A hidden Markov model approach to the structure of documentaries

We have hand-segmented two very long documentaries (100 minutes total) into their component shots. As with other extended videos, shot distribution again appears to be log-normal. Shot lengths are similar to those in dramas, comedies, or action films, but much shorter than those in home videos. The use of fades appears to be an important device to signal transitions between semantic units. We have sought evidence for shot composition rules by means of hidden Markov models (HMMs). We find that camera motion (tilt, pan, zoom) is not significantly governed by rules. However, the bulk of the documentaries take the form of an alternation between commentators and several types of primary supporting material; additionally, the documentaries end with a visual summary. We find that the best approach is one that trains the HMM with labeled subsequences that have approximately equal elapsed time, rather than subsequences with an equal number of shots, or subsequences with shots aligned to some semantic event. This may reflect fundamental temporal limits on human visual attention. We propose that such an underlying structure can suggest more human-sensitive designs for the analysis and graphic display of the contents of extended videos, for summarization, browsing and indexing.

[1]  Takeo Kanade,et al.  Video skimming and characterization through the combination of image and language understanding , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[2]  Philippe Aigrain,et al.  Medium knowledge-based macro-segmentation of video into sequences , 1997 .

[3]  Boon-Lock Yeo,et al.  Time-constrained clustering for segmentation of video into story units , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[4]  Wayne H. Wolf,et al.  Hidden Markov model parsing of video programs , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  John R. Kender,et al.  Finding skin in color images , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[6]  John R. Kender,et al.  Video scene segmentation via continuous video coherence , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  Minerva M. Yeung,et al.  Efficient matching and clustering of video shots , 1995, Proceedings., International Conference on Image Processing.

[9]  W. R. Garner The Processing of Information and Structure , 1974 .

[10]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[11]  John R. Kender,et al.  On the structure and analysis of home videos , 2000 .

[12]  John Hart The Art of the Storyboard: Storyboarding for Film, TV, and Animation , 1999 .