Prediction-based classification for audiovisual discrimination between laughter and speech

Stavros Petridis, Maja Pantic, Jeffrey F. Cohn

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    13 Citations (Scopus)

    Abstract

    Recent evidence in neuroscience support the theory that prediction of spatial and temporal patterns in the brain plays a key role in human actions and perception. Inspired by these findings, a system that discriminates laughter from speech by modeling the spatial and temporal relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio feature mapping together with the time evolution of audio and visual features for both classes. Classification of a new frame / sequence is performed via prediction. All the networks produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The model which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, provides its label to the input frame / sequence. Using 4 different datasets, the proposed system is compared to standard feature-level fusion on cross-database experiments. In almost all test cases, prediction-based classification outperforms feature-level fusion. Similar conclusion are drawn when adding artificial feature-level noise to the datasets.
    Original languageUndefined
    Title of host publicationIEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011)
    Place of PublicationUSA
    PublisherIEEE Computer Society
    Pages619-626
    Number of pages8
    ISBN (Print)978-1-4244-9140-7
    DOIs
    Publication statusPublished - Mar 2011
    Event9th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2011 - Santa Barbara, United States
    Duration: 21 Mar 201125 Mar 2011
    Conference number: 9

    Publication series

    Name
    PublisherIEEE Computer Society

    Conference

    Conference9th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2011
    Abbreviated titleFG
    CountryUnited States
    CitySanta Barbara
    Period21/03/1125/03/11

    Keywords

    • METIS-285043
    • IR-79506
    • Artificial Neural Networks
    • Feature extraction
    • Hidden Markov models
    • Predictive models
    • Speech
    • Training
    • Visualization
    • HMI-MI: MULTIMODAL INTERACTIONS
    • EWI-21351

    Cite this