Multistage Data Selection-based Unsupervised Speaker Adaptation for Personalized Speech Emotion Recognition

Jaebok Kim, Jeong-Sik Park

    Research output: Contribution to journalArticleAcademicpeer-review

    19 Citations (Scopus)
    12 Downloads (Pure)


    This paper proposes an efficient speech emotion recognition (SER) approach that utilizes personal voice data accumulated on personal devices. A representative weakness of conventional SER systems is the user-dependent performance induced by the speaker independent (SI) acoustic model framework. But, handheld communications devices such as smartphones provide a collection of individual voice data, thus providing suitable conditions for personalized SER that is more enhanced than the SI model framework. By taking advantage of personal devices, we propose an efficient personalized SER scheme employing maximum likelihood linear regression (MLLR), a representative speaker adaptation technique. To further advance the conventional MLLR technique for SER tasks, the proposed approach selects useful data that convey emotionally discriminative acoustic characteristics and uses only those data for adaptation. For reliable data selection, we conduct multistage selection using a log-likelihood distance-based measure and a universal background model. On SER experiments based on a Linguistic Data Consortium emotional speech corpus, our approach exhibited superior performance when compared to conventional adaptation techniques as well as the SI model framework.
    Original languageEnglish
    Pages (from-to)126-134
    Number of pages9
    JournalEngineering applications of artificial intelligence
    Publication statusPublished - Jun 2016


    • HMI-SLT: Speech and Language Technology
    • speaker adaptation
    • speech emotion detection
    • IR-102933
    • EC Grant Agreement nr.: FP7/611153
    • Hidden-Markov-Model
    • METIS-320902
    • EWI-27463
    • n/a OA procedure


    Dive into the research topics of 'Multistage Data Selection-based Unsupervised Speaker Adaptation for Personalized Speech Emotion Recognition'. Together they form a unique fingerprint.

    Cite this