Classifying laughter and speech using audio-visual feature prediction

Stavros Petridis, Ali Asghar, Maja Pantic

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    9 Citations (Scopus)
    70 Downloads (Pure)

    Abstract

    In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio features mapping for both classes. Classification of a new frame is performed via prediction. All the networks produce a prediction of the expected audio/visual features and the network with the best prediction, i.e., the model which best describes the audiovisual feature relationship, provides its label to the input frame. When trained on a simple dataset and tested on a hard dataset, the proposed approach outperforms audiovisual feature-level fusion, resulting in a 10.9% and 6.4% absolute increase in the F1 rate for laughter and classification rate, respectively. This indicates that classification based on prediction can produce a good model even when the available dataset is not challenging enough.
    Original languageUndefined
    Title of host publicationProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010)
    Place of PublicationUSA
    PublisherIEEE Computer Society
    Pages5254-5257
    Number of pages4
    ISBN (Print)978-1-4244-4295-9
    DOIs
    Publication statusPublished - 17 Mar 2010
    EventIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010 - Dallas, United States
    Duration: 14 Mar 201019 Mar 2010

    Publication series

    Name
    PublisherIEEE Computer Society
    ISSN (Print)1520-6149

    Conference

    ConferenceIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010
    Abbreviated titleICASSP
    CountryUnited States
    CityDallas
    Period14/03/1019/03/10

    Keywords

    • IR-75891
    • METIS-275890
    • Audiovisual speech/laughter feature relationship
    • EWI-19477
    • laughter-vs-speech discrimination
    • HMI-MI: MULTIMODAL INTERACTIONS
    • prediction-based classification

    Cite this

    Petridis, S., Asghar, A., & Pantic, M. (2010). Classifying laughter and speech using audio-visual feature prediction. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010) (pp. 5254-5257). USA: IEEE Computer Society. https://doi.org/10.1109/ICASSP.2010.5494992
    Petridis, Stavros ; Asghar, Ali ; Pantic, Maja. / Classifying laughter and speech using audio-visual feature prediction. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010). USA : IEEE Computer Society, 2010. pp. 5254-5257
    @inproceedings{60d6a5e066d04f79aa8be32823134c30,
    title = "Classifying laughter and speech using audio-visual feature prediction",
    abstract = "In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio features mapping for both classes. Classification of a new frame is performed via prediction. All the networks produce a prediction of the expected audio/visual features and the network with the best prediction, i.e., the model which best describes the audiovisual feature relationship, provides its label to the input frame. When trained on a simple dataset and tested on a hard dataset, the proposed approach outperforms audiovisual feature-level fusion, resulting in a 10.9{\%} and 6.4{\%} absolute increase in the F1 rate for laughter and classification rate, respectively. This indicates that classification based on prediction can produce a good model even when the available dataset is not challenging enough.",
    keywords = "IR-75891, METIS-275890, Audiovisual speech/laughter feature relationship, EWI-19477, laughter-vs-speech discrimination, HMI-MI: MULTIMODAL INTERACTIONS, prediction-based classification",
    author = "Stavros Petridis and Ali Asghar and Maja Pantic",
    note = "10.1109/ICASSP.2010.5494992",
    year = "2010",
    month = "3",
    day = "17",
    doi = "10.1109/ICASSP.2010.5494992",
    language = "Undefined",
    isbn = "978-1-4244-4295-9",
    publisher = "IEEE Computer Society",
    pages = "5254--5257",
    booktitle = "Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010)",
    address = "United States",

    }

    Petridis, S, Asghar, A & Pantic, M 2010, Classifying laughter and speech using audio-visual feature prediction. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010). IEEE Computer Society, USA, pp. 5254-5257, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, Dallas, United States, 14/03/10. https://doi.org/10.1109/ICASSP.2010.5494992

    Classifying laughter and speech using audio-visual feature prediction. / Petridis, Stavros; Asghar, Ali; Pantic, Maja.

    Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010). USA : IEEE Computer Society, 2010. p. 5254-5257.

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    TY - GEN

    T1 - Classifying laughter and speech using audio-visual feature prediction

    AU - Petridis, Stavros

    AU - Asghar, Ali

    AU - Pantic, Maja

    N1 - 10.1109/ICASSP.2010.5494992

    PY - 2010/3/17

    Y1 - 2010/3/17

    N2 - In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio features mapping for both classes. Classification of a new frame is performed via prediction. All the networks produce a prediction of the expected audio/visual features and the network with the best prediction, i.e., the model which best describes the audiovisual feature relationship, provides its label to the input frame. When trained on a simple dataset and tested on a hard dataset, the proposed approach outperforms audiovisual feature-level fusion, resulting in a 10.9% and 6.4% absolute increase in the F1 rate for laughter and classification rate, respectively. This indicates that classification based on prediction can produce a good model even when the available dataset is not challenging enough.

    AB - In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio features mapping for both classes. Classification of a new frame is performed via prediction. All the networks produce a prediction of the expected audio/visual features and the network with the best prediction, i.e., the model which best describes the audiovisual feature relationship, provides its label to the input frame. When trained on a simple dataset and tested on a hard dataset, the proposed approach outperforms audiovisual feature-level fusion, resulting in a 10.9% and 6.4% absolute increase in the F1 rate for laughter and classification rate, respectively. This indicates that classification based on prediction can produce a good model even when the available dataset is not challenging enough.

    KW - IR-75891

    KW - METIS-275890

    KW - Audiovisual speech/laughter feature relationship

    KW - EWI-19477

    KW - laughter-vs-speech discrimination

    KW - HMI-MI: MULTIMODAL INTERACTIONS

    KW - prediction-based classification

    U2 - 10.1109/ICASSP.2010.5494992

    DO - 10.1109/ICASSP.2010.5494992

    M3 - Conference contribution

    SN - 978-1-4244-4295-9

    SP - 5254

    EP - 5257

    BT - Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010)

    PB - IEEE Computer Society

    CY - USA

    ER -

    Petridis S, Asghar A, Pantic M. Classifying laughter and speech using audio-visual feature prediction. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010). USA: IEEE Computer Society. 2010. p. 5254-5257 https://doi.org/10.1109/ICASSP.2010.5494992