Stochastic modeling of soundtrack for efficient segmentation and indexing of video

Tools for efficient and intelligent management of digital content are essential for digital video data management. An extremely challenging research area in this context is that of multimedia analysis and understanding. The capabilities of audio analysis in particular for video data management are yet to be fully exploited. We present a novel scheme for indexing and segmentation of video by analyzing the audio track. This analysis is then applied to the segmentation and indexing of movies. We build models for some interesting events in the motion picture soundtrack. The models built include music, human speech and silence. We propose the use of hidden Markov models to model the dynamics of the soundtrack and detect audio-events. Using these models we segment and index the soundtrack. A practical problem in motion picture soundtracks is that the audio in the track is of a composite nature. This corresponds to the mixing of sounds from different sources. Speech in foreground and music in background are common examples. The coexistence of multiple individual audio sources forces us to model such events explicitly. Experiments reveal that explicit modeling gives better result than modeling individual audio events separately.

[1]  Milind R. Naphade,et al.  Novel scheme for fast and efficent video sequence matching using compact signatures , 1999, Electronic Imaging.

[2]  B. S. Manjunath,et al.  Content-based search of video using color, texture, and motion , 1997, Proceedings of International Conference on Image Processing.

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[5]  Hiroshi Hamada,et al.  Video Handling with Music and Speech Detection , 1998, IEEE Multim..

[6]  Akio Nagasaka,et al.  Automatic Video Indexing and Full-Video Search for Object Appearances , 1991, VDB.

[7]  Yihong Gong,et al.  Video parsing using compressed data , 1994, Electronic Imaging.

[8]  Jeho Nam,et al.  Speaker identification and video analysis for hierarchical video shot classification , 1997, Proceedings of International Conference on Image Processing.

[9]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[10]  Tsutomu Miyasato,et al.  Emotion recognition from audiovisual information , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[11]  Shih-Fu Chang,et al.  Spatio-temporal video search using the object based video representation , 1997, Proceedings of International Conference on Image Processing.

[12]  Michael Hawley Structure out of sound , 1993 .

[13]  A. Murat Tekalp,et al.  A high-performance shot boundary detection algorithm using multiple cues , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).