Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks

Florian Eyben, Stavros Petridis, Björn Schuller, Georgios Tzimiropoulos, Stefanos Zafeiriou, Maja Pantic

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    15 Citations (Scopus)

    Abstract

    We investigate classification of non-linguistic vocalisations with a novel audiovisual approach and Long Short-Term Memory (LSTM) Recurrent Neural Networks as highly successful dynamic sequence classifiers. As database of evaluation serves this year's Paralinguistic Challenge's Audiovisual Interest Corpus of human-to-human natural conversation. For video-based analysis we compare shape and appearance based features. These are fused in an early manner with typical audio descriptors. The results show significant improvements of LSTM networks over a static approach based on Support Vector Machines. More important, we can show a significant gain in performance when fusing audio and visual shape features.
    Original languageUndefined
    Title of host publicationIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2011)
    Place of PublicationUSA
    PublisherIEEE Signal Processing Society
    Pages5844-5847
    Number of pages4
    ISBN (Print)978-1-4577-0538-0
    DOIs
    Publication statusPublished - May 2011
    EventIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2011 - Prague, Czech Republic
    Duration: 22 May 201127 May 2011

    Publication series

    Name
    PublisherIEEE Signal Processing Society
    ISSN (Print)1520-6149

    Conference

    ConferenceIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2011
    Abbreviated titleICASSP
    CountryCzech Republic
    CityPrague
    Period22/05/1127/05/11

    Keywords

    • METIS-285044
    • IR-79507
    • Audio signal processing
    • Support Vector Machines
    • HMI-MI: MULTIMODAL INTERACTIONS
    • EC Grant Agreement nr.: FP7/211486
    • video signal processing
    • recurrent neural nets
    • audio-visual systems
    • EWI-21353

    Cite this

    Eyben, F., Petridis, S., Schuller, B., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2011). Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2011) (pp. 5844-5847). USA: IEEE Signal Processing Society. https://doi.org/10.1109/ICASSP.2011.5947690