Audiovisual vocal outburst classification in noisy conditions

Florian Eyben, Stavros Petridis, Björn Schuller, Maja Pantic

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

5 Citations (Scopus)
39 Downloads (Pure)

Abstract

In this study, we investigate an audiovisual approach for classification of vocal outbursts (non-linguistic vocalisations) in noisy conditions using Long Short-Term Memory (LSTM) Recurrent Neural Networks and Support Vector Machines. Fusion of geometric shape features and acoustic low-level descriptors is performed on the feature level. Three different types of acoustic noise are considered: babble, office and street noise. Experiments are conducted on every noise type to asses the benefit of the fusion in each case. As database for evaluations serves the INTERSPEECH 2010 Paralinguistic Challenge’s Audiovisual Interest Corpus of human-to-human natural conversation. The results show that even when training is performed on noise corrupted audio which matches the test conditions the addition of visual features is still beneficial.
Original languageUndefined
Title of host publicationProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012
Place of PublicationUSA
PublisherIEEE Computer Society
Pages5097-5100
Number of pages4
ISBN (Print)978-1-4673-0045-2
DOIs
Publication statusPublished - 25 Mar 2012
EventIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012 - Kyoto, Japan
Duration: 25 Mar 201230 Mar 2012

Publication series

Name
PublisherIEEE Computer Society
ISSN (Print)1520-6149

Conference

ConferenceIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012
Abbreviated titleICASSP
CountryJapan
CityKyoto
Period25/03/1230/03/12

Keywords

  • EWI-23055
  • METIS-296292
  • IR-84320
  • HMI-MI: MULTIMODAL INTERACTIONS

Cite this

Eyben, F., Petridis, S., Schuller, B., & Pantic, M. (2012). Audiovisual vocal outburst classification in noisy conditions. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012 (pp. 5097-5100). USA: IEEE Computer Society. https://doi.org/10.1109/ICASSP.2012.6289067
Eyben, Florian ; Petridis, Stavros ; Schuller, Björn ; Pantic, Maja. / Audiovisual vocal outburst classification in noisy conditions. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012. USA : IEEE Computer Society, 2012. pp. 5097-5100
@inproceedings{136a1f81dfc9422baddc315070b8cd94,
title = "Audiovisual vocal outburst classification in noisy conditions",
abstract = "In this study, we investigate an audiovisual approach for classification of vocal outbursts (non-linguistic vocalisations) in noisy conditions using Long Short-Term Memory (LSTM) Recurrent Neural Networks and Support Vector Machines. Fusion of geometric shape features and acoustic low-level descriptors is performed on the feature level. Three different types of acoustic noise are considered: babble, office and street noise. Experiments are conducted on every noise type to asses the benefit of the fusion in each case. As database for evaluations serves the INTERSPEECH 2010 Paralinguistic Challenge’s Audiovisual Interest Corpus of human-to-human natural conversation. The results show that even when training is performed on noise corrupted audio which matches the test conditions the addition of visual features is still beneficial.",
keywords = "EWI-23055, METIS-296292, IR-84320, HMI-MI: MULTIMODAL INTERACTIONS",
author = "Florian Eyben and Stavros Petridis and Bj{\"o}rn Schuller and Maja Pantic",
note = "10.1109/ICASSP.2012.6289067",
year = "2012",
month = "3",
day = "25",
doi = "10.1109/ICASSP.2012.6289067",
language = "Undefined",
isbn = "978-1-4673-0045-2",
publisher = "IEEE Computer Society",
pages = "5097--5100",
booktitle = "Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012",
address = "United States",

}

Eyben, F, Petridis, S, Schuller, B & Pantic, M 2012, Audiovisual vocal outburst classification in noisy conditions. in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012. IEEE Computer Society, USA, pp. 5097-5100, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012, Kyoto, Japan, 25/03/12. https://doi.org/10.1109/ICASSP.2012.6289067

Audiovisual vocal outburst classification in noisy conditions. / Eyben, Florian; Petridis, Stavros; Schuller, Björn; Pantic, Maja.

Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012. USA : IEEE Computer Society, 2012. p. 5097-5100.

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Audiovisual vocal outburst classification in noisy conditions

AU - Eyben, Florian

AU - Petridis, Stavros

AU - Schuller, Björn

AU - Pantic, Maja

N1 - 10.1109/ICASSP.2012.6289067

PY - 2012/3/25

Y1 - 2012/3/25

N2 - In this study, we investigate an audiovisual approach for classification of vocal outbursts (non-linguistic vocalisations) in noisy conditions using Long Short-Term Memory (LSTM) Recurrent Neural Networks and Support Vector Machines. Fusion of geometric shape features and acoustic low-level descriptors is performed on the feature level. Three different types of acoustic noise are considered: babble, office and street noise. Experiments are conducted on every noise type to asses the benefit of the fusion in each case. As database for evaluations serves the INTERSPEECH 2010 Paralinguistic Challenge’s Audiovisual Interest Corpus of human-to-human natural conversation. The results show that even when training is performed on noise corrupted audio which matches the test conditions the addition of visual features is still beneficial.

AB - In this study, we investigate an audiovisual approach for classification of vocal outbursts (non-linguistic vocalisations) in noisy conditions using Long Short-Term Memory (LSTM) Recurrent Neural Networks and Support Vector Machines. Fusion of geometric shape features and acoustic low-level descriptors is performed on the feature level. Three different types of acoustic noise are considered: babble, office and street noise. Experiments are conducted on every noise type to asses the benefit of the fusion in each case. As database for evaluations serves the INTERSPEECH 2010 Paralinguistic Challenge’s Audiovisual Interest Corpus of human-to-human natural conversation. The results show that even when training is performed on noise corrupted audio which matches the test conditions the addition of visual features is still beneficial.

KW - EWI-23055

KW - METIS-296292

KW - IR-84320

KW - HMI-MI: MULTIMODAL INTERACTIONS

U2 - 10.1109/ICASSP.2012.6289067

DO - 10.1109/ICASSP.2012.6289067

M3 - Conference contribution

SN - 978-1-4673-0045-2

SP - 5097

EP - 5100

BT - Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012

PB - IEEE Computer Society

CY - USA

ER -

Eyben F, Petridis S, Schuller B, Pantic M. Audiovisual vocal outburst classification in noisy conditions. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012. USA: IEEE Computer Society. 2012. p. 5097-5100 https://doi.org/10.1109/ICASSP.2012.6289067