A hidden Markov model framework for video segmentation using audio and image features

This paper describes a technique for segmenting video using hidden Markov models (HMM). Video is segmented into regions defined by shots, shot boundaries, and camera movement within shots. Features for segmentation include an image-based distance between adjacent video frames, an audio distance based on the acoustic difference in intervals just before and after the frames, and an estimate of motion between the two frames. Typical video segmentation algorithms classify shot boundaries by computing an image-based distance between adjacent frames and comparing this distance to fixed, manually determined thresholds. Motion and audio information is used separately. In contrast, our segmentation technique allows features to be combined within the HMM framework. Further, thresholds are not required since automatically trained HMMs take their place. This algorithm has been tested on a video data base, and has been shown to improve the accuracy of video segmentation over standard threshold-based systems.

[1]  Ramesh C. Jain,et al.  Dynamic vision , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  Herbert Gish,et al.  Segregation of speakers for speech recognition and speaker identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Lynn Wilcox,et al.  Audio indexing using speaker identification , 1994, Optics & Photonics.

[5]  Behzad Shahraray,et al.  Scene change detection and content-based sampling of video sequences , 1995, Electronic Imaging.

[6]  Ramin Zabih,et al.  A feature-based algorithm for detecting and classifying scene breaks , 1995, MULTIMEDIA '95.

[7]  Michael S. Phillips,et al.  Video segmentation techniques for news , 1996, Other Conferences.

[8]  John S. Boreczky,et al.  Comparison of video shot boundary detection techniques , 1996, J. Electronic Imaging.

[9]  Wayne H. Wolf,et al.  Hidden Markov model parsing of video programs , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Jeho Nam,et al.  Speaker identification and video analysis for hierarchical video shot classification , 1997, Proceedings of International Conference on Image Processing.

[11]  Riccardo Leonardi,et al.  Audio as a support to scene change detection and characterization of video sequences , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  John S. Boreczky,et al.  Annotation and segmentation for multimedia indexing and retrieval , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.