Exploiting Visual Cues in Non-Scripted Lecture Videos for Multi-modal Action Recognition

Ali Shariq Imran, Alejandro Manuel Moreno Celleri, Faouzi Alaya Cheikh

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    7 Citations (Scopus)
    100 Downloads (Pure)

    Abstract

    The usage of non-scripted lecture videos as a part of learning material is becoming an everyday activity in most of higher education institutions due to the growing interest in flexible and blended education. Generally these videos are delivered as part of Learning Objects (LO) through various Learning Management Systems (LMS). Currently creating these video learning objects (VLO) is a cumbersome process. Because it requires thorough analyses of the lecture content for meta-data extraction and the extraction of the structural information for indexing and retrieval purposes. Current e-learning systems and libraries (such as libSCORM) lack the functionally for exploiting semantic content for automatic segmentation. Without the additional meta-data and structural information lecture videos thus do not provide the required level of interactivity required for flexible education. As a result, they fail to captivate students’ attention for long time and thus their effective use remains a challenge. Exploiting visual actions present in non-scripted lecture videos can be useful for automatically segmenting and extracting the structure of these videos. Such visual cues help identify possible key frames, index points, key events and relevant meta-data useful for e-learning systems, video surrogates and video skims. We therefore, propose a multimodel action classification system for four predefined actions performed by instructor in lecture videos. These actions are writing, erasing, speaking and being idle. The proposed approach is based on human shape and motion analysis using motion history images (MHI) at different temporal resolutions allowing robust action classification. Additionally, it augments the visual features classification based on audio analysis which is shown to improve the overall action classification performance. The initial experimental results using recorded lecture videos gave an overall classification accuracy of 89.06%. We evaluated the performance of our approch to template matching using correlation and similitude and found nearly 30% improvement over it. These are very encouraging results that prove the validity of the approach and its potential in extracting structural information from instructional videos.
    Original languageUndefined
    Title of host publicationEighth International Conference on Signal Image Technology and Internet Based Systems (SITIS 2012)
    Place of PublicationUSA
    PublisherIEEE
    Pages8-14
    Number of pages7
    ISBN (Print)978-1-4673-5152-2
    DOIs
    Publication statusPublished - Nov 2012
    Event8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012 - Sorrento, Italy
    Duration: 25 Nov 201229 Nov 2012

    Publication series

    Name
    PublisherIEEE Computer Society

    Conference

    Conference8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012
    Period25/11/1229/11/12
    Other25-29 November 2012

    Keywords

    • EWI-23247
    • METIS-296451
    • IR-85505

    Cite this