Learning spectral-temporal features with 3D CNNs for speech emotion recognition

    Research output: Contribution to conferencePaperpeer-review

    27 Downloads (Pure)

    Abstract

    In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.
    Original languageEnglish
    Number of pages6
    Publication statusPublished - 2017
    Event7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 - San Antonio, United States
    Duration: 23 Oct 201726 Oct 2017
    Conference number: 7
    http://acii2017.org/

    Conference

    Conference7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017
    Abbreviated titleACII 2017
    Country/TerritoryUnited States
    CitySan Antonio
    Period23/10/1726/10/17
    Internet address

    Fingerprint

    Dive into the research topics of 'Learning spectral-temporal features with 3D CNNs for speech emotion recognition'. Together they form a unique fingerprint.

    Cite this