Abstract
In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract short-term and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other state-of-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectro-temporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.
Original language | English |
---|---|
Number of pages | 6 |
Publication status | Published - 2017 |
Event | 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 - San Antonio, United States Duration: 23 Oct 2017 → 26 Oct 2017 Conference number: 7 http://acii2017.org/ |
Conference
Conference | 7th International Conference on Affective Computing and Intelligent Interaction, ACII 2017 |
---|---|
Abbreviated title | ACII 2017 |
Country/Territory | United States |
City | San Antonio |
Period | 23/10/17 → 26/10/17 |
Internet address |