Integrating Audio Visual Data for Human Action Detection

This paper presents a method which able to integrate audio and visual information for action scene analysis in any movie. The approach is top-down for determining and extract action scenes in video by analyzing both audio and video data. In this paper, we directly modelled the hierarchy and shared structures of human behaviours, and we present a framework of the hidden Markov model based application for the problem of activity recognition. We proposed a framework for recognizing actions by measuring human action-based information from video with the following characteristics: the method deals with both visual and auditory information, and captures both spatial and temporal characteristics; and the extracted features are natural, in the sense that they are closely related to the human perceptual processing. Our effort was to implementing idea of action identification by extracting syntactic properties of a video such as edge feature extraction, colour distribution, audio and motion vectors. In this paper, we present a two layers hierarchical module for action recognition. The first one performs supervised learning to recognize individual actions of participants using low-level visual features. The second layer models actions, using the output of the first layer as observations, and fuse with the high level audio features. Both layers use hidden Markov model-based approaches for action recognition and clustering, respectively. Our proposed technique characterizes the scenes by integration cues obtained from both the video and audio tracks. We are sure that using joint audio and visual information can significantly improve the accuracy for action detection over using audio or visual information only. This is because multimodal features can resolve ambiguities that are present in a single modality. Besides, we modelled them into multidimensional form.

[1]  J. K. Aggarwal,et al.  Tracking and recognizing two-person interactions in outdoor image sequences , 2001, Proceedings 2001 IEEE Workshop on Multi-Object Tracking.

[2]  Stephen W. Smoliar,et al.  Content based video indexing and retrieval , 1994, IEEE MultiMedia.

[3]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Ramakant Nevatia,et al.  Representation and optimal recognition of human activities , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[5]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[6]  Shih-Fu Chang,et al.  A fully automated content-based video search engine supporting spatiotemporal queries , 1998, IEEE Trans. Circuits Syst. Video Technol..

[7]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[8]  S. Seitz,et al.  View Morphing: Uniquely Predicting Scene Appearance from Basis Images , 1997 .

[9]  John S. Boreczky,et al.  A hidden Markov model framework for video segmentation using audio and image features , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).