Spotting social signals in conversational speech over IP: A deep learning perspective

Raymond Brueckner, Maximilian Schmitt, Maja Pantic, Björn Schuller

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    6 Citations (Scopus)
    183 Downloads (Pure)

    Abstract

    The automatic detection and classification of social signals is an important task, given the fundamental role nonverbal behavioral cues play in human communication. We present the first cross-lingual study on the detection of laughter and fillers in conversational and spontaneous speech collected 'in the wild' over IP (internet protocol). Further, this is the first comparison of LSTM and GRU networks to shed light on their performance differences. We report frame-based results in terms of the unweighted-average area-under-the-curve (UAAUC) measure and will shortly discuss its suitability for this task. In the mono-lingual setup our best deep BLSTM system achieves 87.0% and 86.3% UAAUC for English and German, respectively. Interestingly, the cross-lingual results are only slightly lower, yielding 83.7% for a system trained on English, but tested on German, and 85.0% in the opposite case. We show that LSTM and GRU architectures are valid alternatives for e. g., on-line and compute-sensitive applications, since their application incurs a relative UAAUC decrease of only approximately 5% with respect to our best systems. Finally, we apply additional smoothing to correct for erroneous spikes and drops in the posterior trajectories to obtain an additional gain in all setups.

    Original languageEnglish
    Title of host publicationProceedings Interspeech 2017
    Pages2371-2375
    Number of pages5
    DOIs
    Publication statusPublished - 2017
    Event18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017: Situated interaction - Stockholm, Sweden
    Duration: 20 Aug 201724 Aug 2017
    Conference number: 18
    http://www.interspeech2017.org/

    Conference

    Conference18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017
    Abbreviated titleINTERSPEECH
    Country/TerritorySweden
    CityStockholm
    Period20/08/1724/08/17
    Internet address

    Keywords

    • Computational paralinguistics
    • Cross-lingual
    • Deep neural networks
    • GRU
    • LSTM
    • Social signal classification

    Fingerprint

    Dive into the research topics of 'Spotting social signals in conversational speech over IP: A deep learning perspective'. Together they form a unique fingerprint.

    Cite this