TY - GEN
T1 - Classifying laughter and speech using audio-visual feature prediction
AU - Petridis, Stavros
AU - Asghar, Ali
AU - Pantic, Maja
N1 - 10.1109/ICASSP.2010.5494992
PY - 2010/3/17
Y1 - 2010/3/17
N2 - In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio features mapping for both classes. Classification of a new frame is performed via prediction. All the networks produce a prediction of the expected audio/visual features and the network with the best prediction, i.e., the model which best describes the audiovisual feature relationship, provides its label to the input frame. When trained on a simple dataset and tested on a hard dataset, the proposed approach outperforms audiovisual feature-level fusion, resulting in a 10.9% and 6.4% absolute increase in the F1 rate for laughter and classification rate, respectively. This indicates that classification based on prediction can produce a good model even when the available dataset is not challenging enough.
AB - In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio features mapping for both classes. Classification of a new frame is performed via prediction. All the networks produce a prediction of the expected audio/visual features and the network with the best prediction, i.e., the model which best describes the audiovisual feature relationship, provides its label to the input frame. When trained on a simple dataset and tested on a hard dataset, the proposed approach outperforms audiovisual feature-level fusion, resulting in a 10.9% and 6.4% absolute increase in the F1 rate for laughter and classification rate, respectively. This indicates that classification based on prediction can produce a good model even when the available dataset is not challenging enough.
KW - IR-75891
KW - METIS-275890
KW - Audiovisual speech/laughter feature relationship
KW - EWI-19477
KW - laughter-vs-speech discrimination
KW - HMI-MI: MULTIMODAL INTERACTIONS
KW - prediction-based classification
U2 - 10.1109/ICASSP.2010.5494992
DO - 10.1109/ICASSP.2010.5494992
M3 - Conference contribution
SN - 978-1-4244-4295-9
SP - 5254
EP - 5257
BT - Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010)
PB - IEEE
CY - USA
T2 - IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010
Y2 - 14 March 2010 through 19 March 2010
ER -