Discrimination Between Native and Non-Native Speech Using Visual Features Only

Christos Georgakis, Stavros Petridis, Maja Pantic

Abstract

Accent is a soft biometric trait that can be inferred from pronunciation and articulation patterns characterizing the speaking style of an individual. Past research has addressed the task of classifying accent, as belonging to a native language speaker or a foreign language speaker, by means of the audio modality only. However, features extracted from the visual stream of speech have been successfully used to extend or substitute audio-only approaches that target speech or language recognition. Motivated by these findings, we investigate to what extent temporal visual speech dynamics attributed to accent can be modeled and identified when the audio stream is missing or noisy, and the speech content is unknown. We present here a fully automated approach to discriminating native from non-native English speech, based exclusively on visual cues. A systematic evaluation of various appearance and shape features for the target problem is conducted, with the former consistently yielding superior performance. Subject-independent cross-validation experiments are conducted on mobile phone recordings of continuous speech and isolated word utterances spoken by 56 subjects from the challenging MOBIO database. High performance is achieved on a text-dependent (TD) protocol, with the best score of 76.5% yielded by fusion of five hidden Markov models trained on appearance features. Our framework is also efficient even when tested on examples of speech unseen in the training phase, although performing less accurately compared to the TD case.
Original languageUndefined
Pages (from-to)2758-2771
Number of pages14
JournalIEEE transactions on cybernetics
Volume46
Issue number12
DOIs
StatePublished - Dec 2016

Fingerprint

Biometrics
Hidden Markov models
Mobile phones
Fusion reactions
Experiments

Keywords

  • HMI-HF: Human Factors
  • EC Grant Agreement nr.: FP7/611153
  • visual accent classification
  • EC Grant Agreement nr.: FP7/2007-2013
  • METIS-315564
  • Non-Native Speech
  • Visual Speech Processing
  • IR-99334
  • Foreign Accent Detection
  • EWI-26750

Cite this

Georgakis, Christos; Petridis, Stavros; Pantic, Maja / Discrimination Between Native and Non-Native Speech Using Visual Features Only.

In: IEEE transactions on cybernetics, Vol. 46, No. 12, 12.2016, p. 2758-2771.

Research output: Scientific - peer-reviewArticle

@article{ad60f34172f24d998a1ae8d743247fb6,
title = "Discrimination Between Native and Non-Native Speech Using Visual Features Only",
abstract = "Accent is a soft biometric trait that can be inferred from pronunciation and articulation patterns characterizing the speaking style of an individual. Past research has addressed the task of classifying accent, as belonging to a native language speaker or a foreign language speaker, by means of the audio modality only. However, features extracted from the visual stream of speech have been successfully used to extend or substitute audio-only approaches that target speech or language recognition. Motivated by these findings, we investigate to what extent temporal visual speech dynamics attributed to accent can be modeled and identified when the audio stream is missing or noisy, and the speech content is unknown. We present here a fully automated approach to discriminating native from non-native English speech, based exclusively on visual cues. A systematic evaluation of various appearance and shape features for the target problem is conducted, with the former consistently yielding superior performance. Subject-independent cross-validation experiments are conducted on mobile phone recordings of continuous speech and isolated word utterances spoken by 56 subjects from the challenging MOBIO database. High performance is achieved on a text-dependent (TD) protocol, with the best score of 76.5% yielded by fusion of five hidden Markov models trained on appearance features. Our framework is also efficient even when tested on examples of speech unseen in the training phase, although performing less accurately compared to the TD case.",
keywords = "HMI-HF: Human Factors, EC Grant Agreement nr.: FP7/611153, visual accent classification, EC Grant Agreement nr.: FP7/2007-2013, METIS-315564, Non-Native Speech, Visual Speech Processing, IR-99334, Foreign Accent Detection, EWI-26750",
author = "Christos Georgakis and Stavros Petridis and Maja Pantic",
note = "eemcs-eprint-26750",
year = "2016",
month = "12",
doi = "10.1109/TCYB.2015.2488592",
volume = "46",
pages = "2758--2771",
journal = "IEEE transactions on cybernetics",
issn = "2168-2267",
publisher = "IEEE Advancing Technology for Humanity",
number = "12",

}

Discrimination Between Native and Non-Native Speech Using Visual Features Only. / Georgakis, Christos; Petridis, Stavros; Pantic, Maja.

In: IEEE transactions on cybernetics, Vol. 46, No. 12, 12.2016, p. 2758-2771.

Research output: Scientific - peer-reviewArticle

TY - JOUR

T1 - Discrimination Between Native and Non-Native Speech Using Visual Features Only

AU - Georgakis,Christos

AU - Petridis,Stavros

AU - Pantic,Maja

N1 - eemcs-eprint-26750

PY - 2016/12

Y1 - 2016/12

N2 - Accent is a soft biometric trait that can be inferred from pronunciation and articulation patterns characterizing the speaking style of an individual. Past research has addressed the task of classifying accent, as belonging to a native language speaker or a foreign language speaker, by means of the audio modality only. However, features extracted from the visual stream of speech have been successfully used to extend or substitute audio-only approaches that target speech or language recognition. Motivated by these findings, we investigate to what extent temporal visual speech dynamics attributed to accent can be modeled and identified when the audio stream is missing or noisy, and the speech content is unknown. We present here a fully automated approach to discriminating native from non-native English speech, based exclusively on visual cues. A systematic evaluation of various appearance and shape features for the target problem is conducted, with the former consistently yielding superior performance. Subject-independent cross-validation experiments are conducted on mobile phone recordings of continuous speech and isolated word utterances spoken by 56 subjects from the challenging MOBIO database. High performance is achieved on a text-dependent (TD) protocol, with the best score of 76.5% yielded by fusion of five hidden Markov models trained on appearance features. Our framework is also efficient even when tested on examples of speech unseen in the training phase, although performing less accurately compared to the TD case.

AB - Accent is a soft biometric trait that can be inferred from pronunciation and articulation patterns characterizing the speaking style of an individual. Past research has addressed the task of classifying accent, as belonging to a native language speaker or a foreign language speaker, by means of the audio modality only. However, features extracted from the visual stream of speech have been successfully used to extend or substitute audio-only approaches that target speech or language recognition. Motivated by these findings, we investigate to what extent temporal visual speech dynamics attributed to accent can be modeled and identified when the audio stream is missing or noisy, and the speech content is unknown. We present here a fully automated approach to discriminating native from non-native English speech, based exclusively on visual cues. A systematic evaluation of various appearance and shape features for the target problem is conducted, with the former consistently yielding superior performance. Subject-independent cross-validation experiments are conducted on mobile phone recordings of continuous speech and isolated word utterances spoken by 56 subjects from the challenging MOBIO database. High performance is achieved on a text-dependent (TD) protocol, with the best score of 76.5% yielded by fusion of five hidden Markov models trained on appearance features. Our framework is also efficient even when tested on examples of speech unseen in the training phase, although performing less accurately compared to the TD case.

KW - HMI-HF: Human Factors

KW - EC Grant Agreement nr.: FP7/611153

KW - visual accent classification

KW - EC Grant Agreement nr.: FP7/2007-2013

KW - METIS-315564

KW - Non-Native Speech

KW - Visual Speech Processing

KW - IR-99334

KW - Foreign Accent Detection

KW - EWI-26750

U2 - 10.1109/TCYB.2015.2488592

DO - 10.1109/TCYB.2015.2488592

M3 - Article

VL - 46

SP - 2758

EP - 2771

JO - IEEE transactions on cybernetics

T2 - IEEE transactions on cybernetics

JF - IEEE transactions on cybernetics

SN - 2168-2267

IS - 12

ER -