Comparison of prediction-based fusion and feature-level fusion across different learning models

Stavros Petridis, Sanjay Bilakhia, Maja Pantic

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    4 Citations (Scopus)
    21 Downloads (Pure)


    There is evidence in neuroscience indicating that prediction of spatial and temporal patterns in the brain plays a key role in perception. This has given rise to prediction-based fusion as a method of combining information from audio and visual modalities. Models are trained on a per-class basis, to learn the mapping from one feature-space to another. When presented with unseen data, each model predicts the respective feature-sets using its learnt mapping, and the prediction error is combined within each class. The model which best describes the audiovisual relationship (by having the lowest combined prediction error) provides its label to the input data. Previous studies have only used neural networks to evaluate this method of combining modalities - this paper extends this to other learning methods, including Long Short-Term Memory recurrent neural networks (LSTMs), Support Vector Machines (SVMs), Relevance Vector Machines (RVMs), and Gaussian Processes (GPs). Our results on cross-database experiments on nonlinguistic vocalisation recognition show that feature-prediction significantly outperforms feature-fusion for neural networks, LSTMs, and GPs, while performance on SVMs and RVMs is more ambiguous and neither model gains an absolute advantage over the other.
    Original languageUndefined
    Title of host publicationProceedings of the 20th ACM International Conference on Multimedia, MM 2012
    Place of PublicationNew York
    PublisherAssociation for Computing Machinery
    Number of pages4
    ISBN (Print)978-1-4503-1089-5
    Publication statusPublished - 29 Oct 2012
    Event20th ACM Multimedia Conference, MM 2012 - Nara, Japan
    Duration: 29 Oct 20122 Nov 2012
    Conference number: 20

    Publication series



    Conference20th ACM Multimedia Conference, MM 2012
    Abbreviated titleMM


    • EWI-22891
    • IR-84312
    • METIS-296222
    • Nonlinguistic Information Processing
    • Prediction-based Classification/Fusion
    • Audio-visual Fusion

    Cite this