Deep Complementary Bottleneck Features for Visual Speech Recognition

Stavros Petridis, Maja Pantic

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademic

    55 Citations (Scopus)
    58 Downloads (Pure)

    Abstract

    Deep bottleneck features (DBNFs) have been used successfully in the past for acoustic speech recognition from audio. However, research on extracting DBNFs for visual speech recognition is very limited. In this work, we present an approach to extract deep bottleneck visual features based on deep autoencoders. To the best of our knowledge, this is the first work that extracts DBNFs for visual speech recognition directly from pixels. We first train a deep autoencoder with a bottleneck layer in order to reduce the dimensionality of the image. Then the autoencoder's decoding layers are replaced by classification layers which make the bottleneck features more discriminative. Discrete Cosine Transform (DCT) features are also appended in the bottleneck layer during training in order to make the bottleneck features complementary to DCT features. Long-Short Term Memory (LSTM) networks are used to model the temporal dynamics and the performance is evaluated on the OuluVS and AVLetters databases. The extracted complementary DBNF in combination with DCT features achieve the best performance resulting in an absolute improvement of up to 5% over the DCT baseline. © 2016 IEEE
    Original languageUndefined
    Title of host publicationProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016
    Place of PublicationDanvers, MA, USA
    PublisherIEEE
    Pages2304-2308
    Number of pages5
    ISBN (Print)978-147999988-0
    DOIs
    Publication statusPublished - Mar 2016
    EventIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016: Signal and Information Processing: The Heartbeat of a Smart Society - Shanghai, China
    Duration: 20 Mar 201625 Mar 2016
    http://www.icassp2016.org/

    Publication series

    NameIEEE International Conference on Acoustics, Speech and Signal Processing
    PublisherInstitute of Electrical and Electronics Engineers
    ISSN (Print)2379-190X

    Conference

    ConferenceIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016
    Abbreviated titleICASSP
    CountryChina
    CityShanghai
    Period20/03/1625/03/16
    Internet address

    Keywords

    • HMI-HF: Human Factors
    • EWI-27130
    • Deep Autoencoders
    • IR-103095
    • Long-Short Term Recurrent Neural Networks
    • Visual Speech Recognition
    • METIS-320876
    • Deep Bottleneck Features

    Cite this