Learning a Contextual Multi-Thread Model for Movie/TV Scene Segmentation

Compared with general videos, movies and TV shows attract a significantly larger portion of people across time and contain very rich and interesting narrative patterns of shots and scenes. In this paper, we aim to recover the inherent structure of scenes and shots in such video narratives. The obtained structure could be useful for subsequent video analysis tasks such as tracking objects across cuts, action retrieval, as well as enriching user browsing and video editing interfaces. Recent research on this problem has mainly focused on combining multiple cues such as scripts, subtitles, sound, or human faces. However, considering that visual information is sufficient for human to identify scene boundaries and some cues are not always available, we are motivated to design a purely visual approach. Observing that dialog patterns occur frequently in a movie/TV show to form a scene, we propose a probabilistic framework to imitate the authoring process. The multi-thread shot model and contextual visual dynamics are embedded into a unified framework to capture the video hierarchy. We devise an efficient algorithm to jointly learn the parameters of the unified model. Experiments on two large datasets containing six movies and 24 episodes of Lost, a popular TV show with complex plot structures, are conducted. Comparative results show that, leveraging only visual cues, our method could successfully recover complicated shot threads and outperform several approaches. Moreover, our method is fast and advantageous for large-scale computation.

[1]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[2]  Haim H. Permuter,et al.  Gaussian mixture models of texture and colour for image database retrieval , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Seungmin Rho,et al.  Video scene determination using audiovisual data analysis , 2004, 24th International Conference on Distributed Computing Systems Workshops, 2004. Proceedings..

[4]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[5]  Chong-Wah Ngo,et al.  Detection of Documentary Scene Changes by Audio-Visual Fusion , 2003, CIVR.

[6]  David Bordwell,et al.  Film Art: An Introduction , 1979 .

[7]  Changsheng Xu,et al.  TVParser: An automatic TV video parsing method , 2011, CVPR 2011.

[8]  Feng Niu,et al.  An SVM Framework for Genre-Independent Scene Change Detection , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[9]  Peng Wang,et al.  Scene Segmentation and Categorization Using NCuts , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Padhraic Smyth,et al.  A Spectral Clustering Approach To Finding Communities in Graph , 2005, SDM.

[11]  Mei-Yuh Hwang,et al.  Speech recognition using hidden Markov models: A CMU perspective , 1990, Speech Communication.

[12]  Wallapak Tavanapong,et al.  Shot clustering techniques for story browsing , 2004, IEEE Transactions on Multimedia.

[13]  Eugene Charniak,et al.  Statistical Techniques for Natural Language Parsing , 1997, AI Mag..

[14]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[15]  Thomas S. Huang,et al.  Constructing table-of-content for videos , 1999, Multimedia Systems.

[16]  John R. Kender,et al.  Video scene segmentation via continuous video coherence , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[17]  Mubarak Shah,et al.  Detection and representation of scenes in videos , 2005, IEEE Transactions on Multimedia.

[18]  Alan Hanjalic,et al.  Automated high-level movie segmentation for advanced video-retrieval systems , 1999, IEEE Trans. Circuits Syst. Video Technol..

[19]  Chengcui Zhang,et al.  Scene change detection by audio and video clues , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[20]  Nikolas P. Galatsanos,et al.  Scene Detection in Videos Using Shot Clustering and Sequence Alignment , 2009, IEEE Transactions on Multimedia.

[21]  Ajay Divakaran,et al.  Discriminative genre-independent audio-visual scene change detection , 2009, Electronic Imaging.

[22]  Mubarak Shah,et al.  Video scene segmentation using Markov chain Monte Carlo , 2006, IEEE Transactions on Multimedia.

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[25]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[26]  Yiannis Kompatsiaris,et al.  Temporal Video Segmentation to Scenes Using High-Level Audiovisual Features , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[28]  Liming Chen,et al.  Multimodal Data Fusion for Video Scene Segmentation , 2005, VISUAL.

[29]  Bo Zhang,et al.  A Formal Study of Shot Boundary Detection , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Miki Haseyama,et al.  Audio signal segmentation and classification for scene-cut detection , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[31]  Boon-Lock Yeo,et al.  Segmentation of Video by Clustering and Graph Analysis , 1998, Comput. Vis. Image Underst..

[32]  Shih-Fu Chang,et al.  Video scene segmentation using video and audio features , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[33]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[34]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Yingying Zhu,et al.  Scene change detection based on audio and video content analysis , 2003, Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003.

[36]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[37]  Rainer Lienhart,et al.  Scene Determination Based on Video and Audio Features , 2004, Multimedia Tools and Applications.

[38]  Changsheng Xu,et al.  A Novel Role-Based Movie Scene Segmentation Method , 2009, PCM.

[39]  Yu Cao,et al.  Audio-Assisted Scene Segmentation for Story Browsing , 2003, CIVR.

[40]  Boon-Lock Yeo,et al.  Video browsing using clustering and scene transitions on compressed sequences , 1995, Electronic Imaging.

[41]  GeunSik Jo,et al.  Exploiting Script-Subtitles Alignment to Scene Boundary Dectection in Movie , 2010, 2010 IEEE International Symposium on Multimedia.