Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks

Florian Eyben, Stavros Petridis, Björn Schuller, Georgios Tzimiropoulos, Stefanos Zafeiriou, Maja Pantic

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    15 Citations (Scopus)


    We investigate classification of non-linguistic vocalisations with a novel audiovisual approach and Long Short-Term Memory (LSTM) Recurrent Neural Networks as highly successful dynamic sequence classifiers. As database of evaluation serves this year's Paralinguistic Challenge's Audiovisual Interest Corpus of human-to-human natural conversation. For video-based analysis we compare shape and appearance based features. These are fused in an early manner with typical audio descriptors. The results show significant improvements of LSTM networks over a static approach based on Support Vector Machines. More important, we can show a significant gain in performance when fusing audio and visual shape features.
    Original languageUndefined
    Title of host publicationIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2011)
    Place of PublicationUSA
    PublisherIEEE Signal Processing Society
    Number of pages4
    ISBN (Print)978-1-4577-0538-0
    Publication statusPublished - May 2011
    EventIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2011 - Prague, Czech Republic
    Duration: 22 May 201127 May 2011

    Publication series

    PublisherIEEE Signal Processing Society
    ISSN (Print)1520-6149


    ConferenceIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2011
    Abbreviated titleICASSP
    CountryCzech Republic


    • METIS-285044
    • IR-79507
    • Audio signal processing
    • Support Vector Machines
    • EC Grant Agreement nr.: FP7/211486
    • video signal processing
    • recurrent neural nets
    • audio-visual systems
    • EWI-21353

    Cite this