Robust Speech/Non-Speech Classification in Heterogeneous Multimedia Content

M.A.H. Huijbregts, Franciska M.G. de Jong

    Research output: Contribution to journalArticleAcademicpeer-review

    17 Citations (Scopus)

    Abstract

    In this paper we present a speech/non-speech classification method that allows high quality classification without the need to know in advance what kinds of audible non-speech events are present in an audio recording and that does not require a single parameter to be tuned on in-domain data. Because no parameter tuning is needed and no training data is required to train models for specific sounds, the classifier is able to process a wide range of audio types with varying conditions and thereby contributes to the development of a more robust automatic speech recognition framework. Our speech/non-speech classification system does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech/non-speech classifier. Next, models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. The experiments show that the performance of the proposed system is 83% and 44% (relative) better than that of a common broadcast news speech/non-speech classifier when applied to a collection of meetings recorded with table-top microphones and a collection of Dutch television broadcasts used for TRECVID 2007.
    Original languageUndefined
    Pages (from-to)143-153
    Number of pages11
    JournalSpeech communication
    Volume53
    Issue number2
    DOIs
    Publication statusPublished - Feb 2011

    Keywords

    • EWI-18833
    • SHoUT toolkit
    • Speech/non-speech classification
    • rich transcription
    • IR-75066
    • EC Grant Agreement nr.: FP6/506811
    • EC Grant Agreement nr.: FP6/027413
    • METIS-277450
    • EC Grant Agreement nr.: FP6/027685

    Cite this

    @article{ea2e2f94d4ac4c0eacb8f61f3cb63be9,
    title = "Robust Speech/Non-Speech Classification in Heterogeneous Multimedia Content",
    abstract = "In this paper we present a speech/non-speech classification method that allows high quality classification without the need to know in advance what kinds of audible non-speech events are present in an audio recording and that does not require a single parameter to be tuned on in-domain data. Because no parameter tuning is needed and no training data is required to train models for specific sounds, the classifier is able to process a wide range of audio types with varying conditions and thereby contributes to the development of a more robust automatic speech recognition framework. Our speech/non-speech classification system does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech/non-speech classifier. Next, models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. The experiments show that the performance of the proposed system is 83{\%} and 44{\%} (relative) better than that of a common broadcast news speech/non-speech classifier when applied to a collection of meetings recorded with table-top microphones and a collection of Dutch television broadcasts used for TRECVID 2007.",
    keywords = "EWI-18833, SHoUT toolkit, Speech/non-speech classification, rich transcription, IR-75066, EC Grant Agreement nr.: FP6/506811, EC Grant Agreement nr.: FP6/027413, METIS-277450, EC Grant Agreement nr.: FP6/027685",
    author = "M.A.H. Huijbregts and {de Jong}, {Franciska M.G.}",
    note = "10.1016/j.specom.2010.08.008",
    year = "2011",
    month = "2",
    doi = "10.1016/j.specom.2010.08.008",
    language = "Undefined",
    volume = "53",
    pages = "143--153",
    journal = "Speech communication",
    issn = "0167-6393",
    publisher = "Elsevier",
    number = "2",

    }

    Robust Speech/Non-Speech Classification in Heterogeneous Multimedia Content. / Huijbregts, M.A.H.; de Jong, Franciska M.G.

    In: Speech communication, Vol. 53, No. 2, 02.2011, p. 143-153.

    Research output: Contribution to journalArticleAcademicpeer-review

    TY - JOUR

    T1 - Robust Speech/Non-Speech Classification in Heterogeneous Multimedia Content

    AU - Huijbregts, M.A.H.

    AU - de Jong, Franciska M.G.

    N1 - 10.1016/j.specom.2010.08.008

    PY - 2011/2

    Y1 - 2011/2

    N2 - In this paper we present a speech/non-speech classification method that allows high quality classification without the need to know in advance what kinds of audible non-speech events are present in an audio recording and that does not require a single parameter to be tuned on in-domain data. Because no parameter tuning is needed and no training data is required to train models for specific sounds, the classifier is able to process a wide range of audio types with varying conditions and thereby contributes to the development of a more robust automatic speech recognition framework. Our speech/non-speech classification system does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech/non-speech classifier. Next, models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. The experiments show that the performance of the proposed system is 83% and 44% (relative) better than that of a common broadcast news speech/non-speech classifier when applied to a collection of meetings recorded with table-top microphones and a collection of Dutch television broadcasts used for TRECVID 2007.

    AB - In this paper we present a speech/non-speech classification method that allows high quality classification without the need to know in advance what kinds of audible non-speech events are present in an audio recording and that does not require a single parameter to be tuned on in-domain data. Because no parameter tuning is needed and no training data is required to train models for specific sounds, the classifier is able to process a wide range of audio types with varying conditions and thereby contributes to the development of a more robust automatic speech recognition framework. Our speech/non-speech classification system does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech/non-speech classifier. Next, models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. The experiments show that the performance of the proposed system is 83% and 44% (relative) better than that of a common broadcast news speech/non-speech classifier when applied to a collection of meetings recorded with table-top microphones and a collection of Dutch television broadcasts used for TRECVID 2007.

    KW - EWI-18833

    KW - SHoUT toolkit

    KW - Speech/non-speech classification

    KW - rich transcription

    KW - IR-75066

    KW - EC Grant Agreement nr.: FP6/506811

    KW - EC Grant Agreement nr.: FP6/027413

    KW - METIS-277450

    KW - EC Grant Agreement nr.: FP6/027685

    U2 - 10.1016/j.specom.2010.08.008

    DO - 10.1016/j.specom.2010.08.008

    M3 - Article

    VL - 53

    SP - 143

    EP - 153

    JO - Speech communication

    JF - Speech communication

    SN - 0167-6393

    IS - 2

    ER -