Abstract
There is evidence in neuroscience indicating that prediction of spatial and temporal patterns in the brain plays a key role in perception. This has given rise to prediction-based fusion as a method of combining information from audio and visual modalities. Models are trained on a per-class basis, to learn the mapping from one feature-space to another. When presented with unseen data, each model predicts the respective feature-sets using its learnt mapping, and the prediction error is combined within each class. The model which best describes the audiovisual relationship (by having the lowest combined prediction error) provides its label to the input data. Previous studies have only used neural networks to evaluate this method of combining modalities - this paper extends this to other learning methods, including Long Short-Term Memory recurrent neural networks (LSTMs), Support Vector Machines (SVMs), Relevance Vector Machines (RVMs), and Gaussian Processes (GPs). Our results on cross-database experiments on nonlinguistic vocalisation recognition show that feature-prediction significantly outperforms feature-fusion for neural networks, LSTMs, and GPs, while performance on SVMs and RVMs is more ambiguous and neither model gains an absolute advantage over the other.
Original language | Undefined |
---|---|
Title of host publication | Proceedings of the 20th ACM International Conference on Multimedia, MM 2012 |
Place of Publication | New York |
Publisher | Association for Computing Machinery |
Pages | 813-816 |
Number of pages | 4 |
ISBN (Print) | 978-1-4503-1089-5 |
DOIs | |
Publication status | Published - 29 Oct 2012 |
Event | 20th ACM Multimedia Conference, MM 2012 - Nara, Japan Duration: 29 Oct 2012 → 2 Nov 2012 Conference number: 20 |
Publication series
Name | |
---|---|
Publisher | ACM |
Conference
Conference | 20th ACM Multimedia Conference, MM 2012 |
---|---|
Abbreviated title | MM |
Country/Territory | Japan |
City | Nara |
Period | 29/10/12 → 2/11/12 |
Keywords
- HMI-MI: MULTIMODAL INTERACTIONS
- EWI-22891
- IR-84312
- METIS-296222
- Nonlinguistic Information Processing
- Prediction-based Classification/Fusion
- Audio-visual Fusion