Abstract
Recent evidence in neuroscience support the theory that prediction of spatial and temporal patterns in the brain plays a key role in human actions and perception. Inspired by these findings, a system that discriminates laughter from speech by modeling the spatial and temporal relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio feature mapping together with the time evolution of audio and visual features for both classes. Classification of a new frame / sequence is performed via prediction. All the networks produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The model which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, provides its label to the input frame / sequence. Using 4 different datasets, the proposed system is compared to standard feature-level fusion on cross-database experiments. In almost all test cases, prediction-based classification outperforms feature-level fusion. Similar conclusion are drawn when adding artificial feature-level noise to the datasets.
Original language | Undefined |
---|---|
Title of host publication | IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011) |
Place of Publication | USA |
Publisher | IEEE Computer Society |
Pages | 619-626 |
Number of pages | 8 |
ISBN (Print) | 978-1-4244-9140-7 |
DOIs | |
Publication status | Published - Mar 2011 |
Event | 9th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2011 - Santa Barbara, United States Duration: 21 Mar 2011 → 25 Mar 2011 Conference number: 9 |
Publication series
Name | |
---|---|
Publisher | IEEE Computer Society |
Conference
Conference | 9th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2011 |
---|---|
Abbreviated title | FG |
Country/Territory | United States |
City | Santa Barbara |
Period | 21/03/11 → 25/03/11 |
Keywords
- METIS-285043
- IR-79506
- Artificial Neural Networks
- Feature extraction
- Hidden Markov models
- Predictive models
- Speech
- Training
- Visualization
- HMI-MI: MULTIMODAL INTERACTIONS
- EWI-21351