Abstract
Deep architectures using identity skip-connections have demonstrated groundbreaking performance in the field of image classification. Recently, empirical studies suggested that identity skip-connections enable ensemble-like behaviour of shallow networks, and that depth is not a solo ingredient for their success. Therefore, we examine the potential of identity skip-connections for the task of Speech Emotion Recognition (SER) where moderately deep temporal architectures are often employed. To this end, we propose a novel architecture which regulates unimpeded feature flows and captures long-term dependencies via gate-based skip-connections and a memory mechanism. Our proposed architecture is compared to other state-of-the-art methods of SER and is evaluated on large aggregated corpora recorded in different contexts. Our proposed architecture outperforms the state-of-the-art methods by 9 - 15% and achieves an Unweighted Accuracy of 80.5% in an imbalanced class distribution. In addition, we examine a variant adopting simplified skip-connections of Residual Networks (ResNet) and show that gate-based skip-connections are more effective than simplified skip-connections.
Original language | English |
---|---|
Title of host publication | MM '17 |
Subtitle of host publication | Proceedings of the 2017 ACM on Multimedia Conference |
Publisher | Association for Computing Machinery |
Pages | 1006-1013 |
Number of pages | 8 |
ISBN (Electronic) | 978-1-4503-4906-2 |
DOIs | |
Publication status | Published - 2017 |
Event | 25th ACM Multimedia Conference, MM 2017 - Mountain View, United States Duration: 23 Oct 2017 → 27 Oct 2017 Conference number: 25 http://www.acmmm.org/2017/ |
Conference
Conference | 25th ACM Multimedia Conference, MM 2017 |
---|---|
Abbreviated title | MM |
Country/Territory | United States |
City | Mountain View |
Period | 23/10/17 → 27/10/17 |
Internet address |
Keywords
- speech emotion detection
- deep learning
- recurrent neural nets