Prediction-based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations

Stavros Petridis, Maja Pantic

  • 4 Citations

Abstract

Prediction plays a key role in recent computational models of the brain and it has been suggested that the brain constantly makes multisensory spatiotemporal predictions. Inspired by these findings we tackle the problem of audiovisual fusion from a new perspective based on prediction. We train predictive models which model the spatiotemporal relationship between audio and visual features by learning the audio-to-visual and visual-to-audio feature mapping for each class. Similarly, we train predictive models which model the time evolution of audio and visual features by learning the past-to-future feature mapping for each class. In classification, all the class-specific regression models produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The set of class-specific regressors which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, is chosen to label the input frame. We perform cross-database experiments, using the AMI, SAL and MAHNOB databases, in order to classify laughter and speech and subject-independent experiments on the AVIC database in order to classify laughter, hesitation and consent. In virtually all cases prediction-based audiovisual fusion consistently outperforms the two most commonly used fusion approaches, decision-level and feature-level fusion.
Original languageUndefined
Pages (from-to)45-58
Number of pages14
JournalIEEE transactions on affective computing
Volume7
Issue number1
DOIs
StatePublished - Feb 2016

Fingerprint

Fusion reactions
Brain
Experiments
Labels

Keywords

  • EWI-26753
  • HMI-HF: Human Factors
  • EC Grant Agreement nr.: FP7/611153
  • IR-99374
  • Prediction-based Fusion
  • Audio-visual Fusion
  • METIS-315566
  • Nonlinguistic Vocalisation Classification

Cite this

Petridis, Stavros; Pantic, Maja / Prediction-based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations.

In: IEEE transactions on affective computing, Vol. 7, No. 1, 02.2016, p. 45-58.

Research output: Scientific - peer-reviewArticle

@article{4616da9c27824fbbaba5f4c2b168e91f,
title = "Prediction-based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations",
abstract = "Prediction plays a key role in recent computational models of the brain and it has been suggested that the brain constantly makes multisensory spatiotemporal predictions. Inspired by these findings we tackle the problem of audiovisual fusion from a new perspective based on prediction. We train predictive models which model the spatiotemporal relationship between audio and visual features by learning the audio-to-visual and visual-to-audio feature mapping for each class. Similarly, we train predictive models which model the time evolution of audio and visual features by learning the past-to-future feature mapping for each class. In classification, all the class-specific regression models produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The set of class-specific regressors which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, is chosen to label the input frame. We perform cross-database experiments, using the AMI, SAL and MAHNOB databases, in order to classify laughter and speech and subject-independent experiments on the AVIC database in order to classify laughter, hesitation and consent. In virtually all cases prediction-based audiovisual fusion consistently outperforms the two most commonly used fusion approaches, decision-level and feature-level fusion.",
keywords = "EWI-26753, HMI-HF: Human Factors, EC Grant Agreement nr.: FP7/611153, IR-99374, Prediction-based Fusion, Audio-visual Fusion, METIS-315566, Nonlinguistic Vocalisation Classification",
author = "Stavros Petridis and Maja Pantic",
note = "eemcs-eprint-26753 ; http://eprints.ewi.utwente.nl/26753",
year = "2016",
month = "2",
doi = "10.1109/TAFFC.2015.2446462",
volume = "7",
pages = "45--58",
journal = "IEEE transactions on affective computing",
issn = "1949-3045",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "1",

}

Prediction-based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations. / Petridis, Stavros; Pantic, Maja.

In: IEEE transactions on affective computing, Vol. 7, No. 1, 02.2016, p. 45-58.

Research output: Scientific - peer-reviewArticle

TY - JOUR

T1 - Prediction-based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations

AU - Petridis,Stavros

AU - Pantic,Maja

N1 - eemcs-eprint-26753 ; http://eprints.ewi.utwente.nl/26753

PY - 2016/2

Y1 - 2016/2

N2 - Prediction plays a key role in recent computational models of the brain and it has been suggested that the brain constantly makes multisensory spatiotemporal predictions. Inspired by these findings we tackle the problem of audiovisual fusion from a new perspective based on prediction. We train predictive models which model the spatiotemporal relationship between audio and visual features by learning the audio-to-visual and visual-to-audio feature mapping for each class. Similarly, we train predictive models which model the time evolution of audio and visual features by learning the past-to-future feature mapping for each class. In classification, all the class-specific regression models produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The set of class-specific regressors which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, is chosen to label the input frame. We perform cross-database experiments, using the AMI, SAL and MAHNOB databases, in order to classify laughter and speech and subject-independent experiments on the AVIC database in order to classify laughter, hesitation and consent. In virtually all cases prediction-based audiovisual fusion consistently outperforms the two most commonly used fusion approaches, decision-level and feature-level fusion.

AB - Prediction plays a key role in recent computational models of the brain and it has been suggested that the brain constantly makes multisensory spatiotemporal predictions. Inspired by these findings we tackle the problem of audiovisual fusion from a new perspective based on prediction. We train predictive models which model the spatiotemporal relationship between audio and visual features by learning the audio-to-visual and visual-to-audio feature mapping for each class. Similarly, we train predictive models which model the time evolution of audio and visual features by learning the past-to-future feature mapping for each class. In classification, all the class-specific regression models produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The set of class-specific regressors which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, is chosen to label the input frame. We perform cross-database experiments, using the AMI, SAL and MAHNOB databases, in order to classify laughter and speech and subject-independent experiments on the AVIC database in order to classify laughter, hesitation and consent. In virtually all cases prediction-based audiovisual fusion consistently outperforms the two most commonly used fusion approaches, decision-level and feature-level fusion.

KW - EWI-26753

KW - HMI-HF: Human Factors

KW - EC Grant Agreement nr.: FP7/611153

KW - IR-99374

KW - Prediction-based Fusion

KW - Audio-visual Fusion

KW - METIS-315566

KW - Nonlinguistic Vocalisation Classification

U2 - 10.1109/TAFFC.2015.2446462

DO - 10.1109/TAFFC.2015.2446462

M3 - Article

VL - 7

SP - 45

EP - 58

JO - IEEE transactions on affective computing

T2 - IEEE transactions on affective computing

JF - IEEE transactions on affective computing

SN - 1949-3045

IS - 1

ER -