Comparison of Single-model and Multiple-model Prediction-based Audiovisual Fusion

Stavros Petridis, Varun Rajgarhia, Maja Pantic

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    53 Downloads (Pure)


    Prediction-based fusion is a recently proposed audiovisual fusion approach which outperforms feature-level fusion on laughter-vs-speech discrimination. One set of predictive models is trained per class which learns the audio-to-visual and visual-to-audio feature mapping together with the time evolution of audio and visual features. Classification of a new input is performed via prediction. All the class predictors produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The model which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, provides its label to the input. In all the previous works, a single set of predictors was trained on the entire training set for each class. In this work, we investigate the use of multiple sets of predictors per class. The main idea is that since models are trained on clusters of data, they will be more specialised and they will produce lower prediction errors which can in turn enhance the classification performance. We experimented with subject-based clustering and clustering based on different types of laughter, voiced and unvoiced. Results are presented on laughter-vs-speech discrimination on a cross-database experiment using the AMI and MAHNOB databases. The use of multiple sets of models results in a significant performance increase with the latter clustering approach achieving the best performance. Overall, an increase of over 4% and 10% is observed for F1 speech and laughter, respectively, for both datasets.
    Original languageUndefined
    Title of host publicationProceedings of the 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing, FAAVSP 2015
    Place of PublicationBaixas, France
    PublisherISCA Speech Organisation
    Number of pages6
    ISBN (Print)not assigned
    Publication statusPublished - Sep 2015

    Publication series

    PublisherISCA Speech Organisation


    • EC Grant Agreement nr.: FP7/611153
    • EC Grant Agreement nr.: FP7/2007-2013
    • EWI-26780
    • HMI-HF: Human Factors
    • IR-99458
    • Nonlinguistic Information Processing
    • Prediction-based Fusion
    • METIS-316030
    • Audio-visual Fusion

    Cite this