Learning spectro-temporal features with 3D CNNs for speech emotion recognition

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

31 Citations (Scopus)
12 Downloads (Pure)

Abstract

In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.

Original languageEnglish
Title of host publication2017 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017
PublisherIEEE
Pages383-388
Number of pages6
ISBN (Electronic)978-1-5386-0563-9
DOIs
Publication statusPublished - 1 Feb 2018
Event7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 - San Antonio, United States
Duration: 23 Oct 201726 Oct 2017
Conference number: 7
http://acii2017.org/

Conference

Conference7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017
Abbreviated titleACII 2017
Country/TerritoryUnited States
CitySan Antonio
Period23/10/1726/10/17
Internet address

Keywords

  • 2024 OA procedure

Fingerprint

Dive into the research topics of 'Learning spectro-temporal features with 3D CNNs for speech emotion recognition'. Together they form a unique fingerprint.

Cite this