Multimodal Data Fusion for Video Scene Segmentation

Automatic video segmentation into semantic units is important to organize an effective content based access to long video. The basic building blocks of professional video are shots. However the semantic meaning they provide is of a too low level. In this paper we focus on the problem of video segmentation into more meaningful high-level narrative units called scenes – aggregates of shots that are temporally continuous, share the same physical settings or represent continuous ongoing action. A statistical video scene segmentation framework is proposed which is capable to combine multiple mid-level features in a symmetrical and scalable manner. Two kinds of such features extracted in visual and audio domain are suggested. The results of experimental evaluations carried out on ground truth video are reported. They show that our algorithm effectively fuses multiple modalities with higher performance as compared with an alternative conventional fusion technique.

[1]  Chengcui Zhang,et al.  Scene change detection by audio and video clues , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[2]  Yu Cao,et al.  Audio-Assisted Scene Segmentation for Story Browsing , 2003, CIVR.

[3]  Shih-Fu Chang,et al.  Determining computable scenes in films and their structures using audio-visual memory models , 2000, ACM Multimedia.

[4]  John R. Kender,et al.  Video scene segmentation via continuous video coherence , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[5]  Ivan Magrin-Chagnolleau,et al.  Second-order statistical measures for text-independent speaker identification , 1995, Speech Commun..

[6]  Dong Zhang,et al.  A New Shot Boundary Detection Algorithm , 2001, IEEE Pacific Rim Conference on Multimedia.

[7]  George Tzanetakis,et al.  Audio Analysis using the Discrete Wavelet Transform , 2001 .

[8]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[9]  Liming Chen,et al.  A Query by Example Music Retrieval Algorithm , 2003 .

[10]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[11]  Wallapak Tavanapong,et al.  Shot clustering techniques for story browsing , 2004, IEEE Transactions on Multimedia.

[12]  Philippe Aigrain,et al.  Medium knowledge-based macro-segmentation of video into sequences , 1997 .

[13]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[14]  Mubarak Shah,et al.  A Graph Theoretic Approach for Scene Detection in Produced Videos , 2003 .

[15]  Boon-Lock Yeo,et al.  Time-constrained clustering for segmentation of video into story units , 1996, Proceedings of 13th International Conference on Pattern Recognition.