Comparison of Single-model and Multiple-model Prediction-based Audiovisual Fusion

Stavros Petridis, Varun Rajgarhia, Maja Pantic

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Prediction-based fusion is a recently proposed audiovisual fusion approach which outperforms feature-level fusion on laughter-vs-speech discrimination. One set of predictive models is trained per class which learns the audio-to-visual and visual-to-audio feature mapping together with the time evolution of audio and visual features. Classification of a new input is performed via prediction. All the class predictors produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The model which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, provides its label to the input. In all the previous works, a single set of predictors was trained on the entire training set for each class. In this work, we investigate the use of multiple sets of predictors per class. The main idea is that since models are trained on clusters of data, they will be more specialised and they will produce lower prediction errors which can in turn enhance the classification performance. We experimented with subject-based clustering and clustering based on different types of laughter, voiced and unvoiced. Results are presented on laughter-vs-speech discrimination on a cross-database experiment using the AMI and MAHNOB databases. The use of multiple sets of models results in a significant performance increase with the latter clustering approach achieving the best performance. Overall, an increase of over 4% and 10% is observed for F1 speech and laughter, respectively, for both datasets.
LanguageUndefined
Title of host publicationProceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015
Place of PublicationBaixas, France
PublisherISCA Speech Organisation
Pages109-114
Number of pages6
ISBN (Print)not assigned
StatePublished - Sep 2015

Publication series

Name
PublisherISCA Speech Organisation

Keywords

  • EC Grant Agreement nr.: FP7/611153
  • EC Grant Agreement nr.: FP7/2007-2013
  • EWI-26780
  • HMI-HF: Human Factors
  • IR-99458
  • Nonlinguistic Information Processing
  • Prediction-based Fusion
  • METIS-316030
  • Audio-visual Fusion

Cite this

Petridis, S., Rajgarhia, V., & Pantic, M. (2015). Comparison of Single-model and Multiple-model Prediction-based Audiovisual Fusion. In Proceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015 (pp. 109-114). Baixas, France: ISCA Speech Organisation.
Petridis, Stavros ; Rajgarhia, Varun ; Pantic, Maja. / Comparison of Single-model and Multiple-model Prediction-based Audiovisual Fusion. Proceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015. Baixas, France : ISCA Speech Organisation, 2015. pp. 109-114
@inproceedings{a27292585e3e4e81ae4a7986f7f63928,
title = "Comparison of Single-model and Multiple-model Prediction-based Audiovisual Fusion",
abstract = "Prediction-based fusion is a recently proposed audiovisual fusion approach which outperforms feature-level fusion on laughter-vs-speech discrimination. One set of predictive models is trained per class which learns the audio-to-visual and visual-to-audio feature mapping together with the time evolution of audio and visual features. Classification of a new input is performed via prediction. All the class predictors produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The model which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, provides its label to the input. In all the previous works, a single set of predictors was trained on the entire training set for each class. In this work, we investigate the use of multiple sets of predictors per class. The main idea is that since models are trained on clusters of data, they will be more specialised and they will produce lower prediction errors which can in turn enhance the classification performance. We experimented with subject-based clustering and clustering based on different types of laughter, voiced and unvoiced. Results are presented on laughter-vs-speech discrimination on a cross-database experiment using the AMI and MAHNOB databases. The use of multiple sets of models results in a significant performance increase with the latter clustering approach achieving the best performance. Overall, an increase of over 4{\%} and 10{\%} is observed for F1 speech and laughter, respectively, for both datasets.",
keywords = "EC Grant Agreement nr.: FP7/611153, EC Grant Agreement nr.: FP7/2007-2013, EWI-26780, HMI-HF: Human Factors, IR-99458, Nonlinguistic Information Processing, Prediction-based Fusion, METIS-316030, Audio-visual Fusion",
author = "Stavros Petridis and Varun Rajgarhia and Maja Pantic",
note = "eemcs-eprint-26780",
year = "2015",
month = "9",
language = "Undefined",
isbn = "not assigned",
publisher = "ISCA Speech Organisation",
pages = "109--114",
booktitle = "Proceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015",

}

Petridis, S, Rajgarhia, V & Pantic, M 2015, Comparison of Single-model and Multiple-model Prediction-based Audiovisual Fusion. in Proceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015. ISCA Speech Organisation, Baixas, France, pp. 109-114.

Comparison of Single-model and Multiple-model Prediction-based Audiovisual Fusion. / Petridis, Stavros; Rajgarhia, Varun; Pantic, Maja.

Proceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015. Baixas, France : ISCA Speech Organisation, 2015. p. 109-114.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Comparison of Single-model and Multiple-model Prediction-based Audiovisual Fusion

AU - Petridis,Stavros

AU - Rajgarhia,Varun

AU - Pantic,Maja

N1 - eemcs-eprint-26780

PY - 2015/9

Y1 - 2015/9

N2 - Prediction-based fusion is a recently proposed audiovisual fusion approach which outperforms feature-level fusion on laughter-vs-speech discrimination. One set of predictive models is trained per class which learns the audio-to-visual and visual-to-audio feature mapping together with the time evolution of audio and visual features. Classification of a new input is performed via prediction. All the class predictors produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The model which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, provides its label to the input. In all the previous works, a single set of predictors was trained on the entire training set for each class. In this work, we investigate the use of multiple sets of predictors per class. The main idea is that since models are trained on clusters of data, they will be more specialised and they will produce lower prediction errors which can in turn enhance the classification performance. We experimented with subject-based clustering and clustering based on different types of laughter, voiced and unvoiced. Results are presented on laughter-vs-speech discrimination on a cross-database experiment using the AMI and MAHNOB databases. The use of multiple sets of models results in a significant performance increase with the latter clustering approach achieving the best performance. Overall, an increase of over 4% and 10% is observed for F1 speech and laughter, respectively, for both datasets.

AB - Prediction-based fusion is a recently proposed audiovisual fusion approach which outperforms feature-level fusion on laughter-vs-speech discrimination. One set of predictive models is trained per class which learns the audio-to-visual and visual-to-audio feature mapping together with the time evolution of audio and visual features. Classification of a new input is performed via prediction. All the class predictors produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The model which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, provides its label to the input. In all the previous works, a single set of predictors was trained on the entire training set for each class. In this work, we investigate the use of multiple sets of predictors per class. The main idea is that since models are trained on clusters of data, they will be more specialised and they will produce lower prediction errors which can in turn enhance the classification performance. We experimented with subject-based clustering and clustering based on different types of laughter, voiced and unvoiced. Results are presented on laughter-vs-speech discrimination on a cross-database experiment using the AMI and MAHNOB databases. The use of multiple sets of models results in a significant performance increase with the latter clustering approach achieving the best performance. Overall, an increase of over 4% and 10% is observed for F1 speech and laughter, respectively, for both datasets.

KW - EC Grant Agreement nr.: FP7/611153

KW - EC Grant Agreement nr.: FP7/2007-2013

KW - EWI-26780

KW - HMI-HF: Human Factors

KW - IR-99458

KW - Nonlinguistic Information Processing

KW - Prediction-based Fusion

KW - METIS-316030

KW - Audio-visual Fusion

M3 - Conference contribution

SN - not assigned

SP - 109

EP - 114

BT - Proceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015

PB - ISCA Speech Organisation

CY - Baixas, France

ER -

Petridis S, Rajgarhia V, Pantic M. Comparison of Single-model and Multiple-model Prediction-based Audiovisual Fusion. In Proceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015. Baixas, France: ISCA Speech Organisation. 2015. p. 109-114.