Video Scene Segmentation of TV Series Using Multimodal Neural Features

Scene segmentation of a video, a book or TV series allows to organize them into Logical Story Units and is an essential step for representing, extracting and understanding their narrative structures. We propose an automatic scene segmentation method for TV series based on the grouping of adjacent shots and relying on a combination of multimodal neural features: visual features and textual features, further augmented with the temporal information which may improve the clustering of adjacent shots. Reported experiments compare early and late fusion of the features, video frames subsampling and various shot clustering algorithms. The proposed method achieved good recall, precision and F-measure when tested on several seasons of two popular TV series.

[1]  Nikolas P. Galatsanos,et al.  Scene Detection in Videos Using Shot Clustering and Sequence Alignment , 2009, IEEE Transactions on Multimedia.

[2]  Pascale Sébillot,et al.  Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation , 2012, Comput. Speech Lang..

[3]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[4]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[5]  Xavier Bost A storytelling machine? : Automatic video summarization: the case of TV series. (Une machine à raconter des histoires ? / Une machine à raconter des histoires ? : Résumé automatique de vidéos : le cas des séries TV) , 2016 .

[6]  Yiannis Kompatsiaris,et al.  Multi-modal scene segmentation using scene transition graphs , 2009, ACM Multimedia.

[7]  Boon-Lock Yeo,et al.  Segmentation of Video by Clustering and Graph Analysis , 1998, Comput. Vis. Image Underst..

[8]  Tzvetan Todorov,et al.  The poetics of prose , 1977 .

[9]  Adil Mehmood Khan,et al.  Using deep features for video scene detection and annotation , 2018, Signal, Image and Video Processing.

[10]  C. V. Jawahar,et al.  Video Scene Segmentation with a Semantic Similarity , 2011, IICAI.

[11]  Diana Inkpen,et al.  Getting More from Segmentation Evaluation , 2012, HLT-NAACL.

[12]  Makarand Tapaswi,et al.  StoryGraphs: Visualizing Character Interactions as a Timeline , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  John D. Lafferty,et al.  Text Segmentation Using Exponential Models , 1997, EMNLP.

[16]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  László Böszörményi,et al.  State-of-the-art and future challenges in video scene detection: a survey , 2013, Multimedia Systems.

[18]  Hervé Bredin,et al.  SEGMENTING TV SERIES INTO SCENES USING SPEAKER DIARIZATION , 2010 .

[19]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[20]  Marcel Worring,et al.  Systematic evaluation of logical story unit segmentation , 2002, IEEE Trans. Multim..

[21]  Rita Cucchiara,et al.  A Deep Siamese Network for Scene Detection in Broadcast Videos , 2015, ACM Multimedia.