Exploiting Visual Cues in Non-Scripted Lecture Videos for Multi-modal Action Recognition

The usage of non-scripted lecture videos as a part of learning material is becoming an everyday activity in most of higher education institutions due to the growing interest in flexible and blended education. Generally these videos are delivered as part of Learning Objects (LO) through various Learning Management Systems (LMS). Currently creating these video learning objects (VLO) is a cumbersome process. Because it requires thorough analyses of the lecture content for meta-data extraction and the extraction of the structural information for indexing and retrieval purposes. Current e-learning systems and libraries (such as libSCORM) lack the functionally for exploiting semantic content for automatic segmentation. Without the additional meta-data and structural information lecture videos thus do not provide the required level of interactivity required for flexible education. As a result, they fail to captivate students' attention for long time and thus their effective use remains a challenge. Exploiting visual actions present in non-scripted lecture videos can be useful for automatically segmenting and extracting the structure of these videos. Such visual cues help identify possible key frames, index points, key events and relevant meta-data useful for e-learning systems, video surrogates and video skims. We therefore, propose a multi-model action classification system for four predefined actions performed by instructor in lecture videos. These actions are writing, erasing, speaking and being idle. The proposed approach is based on human shape and motion analysis using motion history images (MHI) at different temporal resolutions allowing robust action classification. Additionally, it augments the visual features classification based on audio analysis which is shown to improve the overall action classification performance. The initial experimental results using recorded lecture videos gave an overall classification accuracy of 89.06%. We evaluated the performance of our approch to template matching using correlation and similitude and found nearly 30% improvement over it. These are very encouraging results that prove the validity of the approach and its potential in extracting structural information from instructional videos.

[1]  Tieniu Tan,et al.  A survey on visual surveillance of object motion and behaviors , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[2]  Xiangming Mu,et al.  Decoupling the Information Application from the Information Creation: Video as Learning Objects in Three-Tier Architecture , 2005 .

[3]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[4]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[5]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  John R. Kender,et al.  Lecture videos for e-learning: current research and challenges , 2004, IEEE Sixth International Symposium on Multimedia Software Engineering.

[7]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[8]  Ian D. Reid,et al.  A general method for human activity recognition in video , 2006, Comput. Vis. Image Underst..

[9]  R. Venkatesh Babu,et al.  Recognition of human actions using motion history information extracted from the compressed video , 2004, Image Vis. Comput..

[10]  Dirk Heylen,et al.  Observations on Experience and Flow in Movement-Based Interaction , 2011, Whole Body Interaction.

[11]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[12]  M. Alex O. Vasilescu Human motion signatures: analysis, synthesis, recognition , 2002, Object recognition supported by user interaction for service robots.

[13]  James W. Davis Hierarchical motion history images for recognizing human motion , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.

[14]  Youtian Du,et al.  Activity recognition through multi-scale motion detail analysis , 2008, Neurocomputing.

[15]  Qi Tian,et al.  Foreground object detection from videos containing complex background , 2003, MULTIMEDIA '03.

[16]  A. S. Imran,et al.  Multimedia learning objects framework for e-learning , 2012, 2012 International Conference on E-Learning and E-Technologies in Education (ICEEE).

[17]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[18]  Zhiquan Wang,et al.  Recognition of human activities using SVM multi-class classifier , 2010, Pattern Recognit. Lett..

[19]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[20]  Antonio Marin-Hernandez,et al.  Modular control for human motion analysis and classification in human-robot interaction , 2010, HRI 2010.

[21]  Honghai Liu,et al.  Advances in View-Invariant Human Motion Analysis: A Review , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[22]  Katsushi Ikeuchi,et al.  Keypose and style analysis based on low-dimensional representation , 2009 .

[23]  Ali Shariq Imran Interactive media learning object in distance and blended education , 2009, MM '09.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.