Abstract
In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.
| Original language | English |
|---|---|
| Number of pages | 6 |
| Publication status | Published - 2017 |
| Event | 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 - San Antonio, United States Duration: 23 Oct 2017 → 26 Oct 2017 Conference number: 7 http://acii2017.org/ |
Conference
| Conference | 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 |
|---|---|
| Abbreviated title | ACII 2017 |
| Country/Territory | United States |
| City | San Antonio |
| Period | 23/10/17 → 26/10/17 |
| Internet address |
Fingerprint
Dive into the research topics of 'Learning spectral-temporal features with 3D CNNs for speech emotion recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver