Hierarchical framework for plot de-interlacing of TV series based on speakers, dialogues and images

Since the 90's, TV series tend to introduce more and more main characters and they are often composed of multiple intertwined stories. In this paper, we propose a hierarchical framework of plot de-interlacing which permits to cluster semantic scenes into stories: a story is a group of scenes not necessarily contiguous but showing a strong semantic relation. Each scene is described using three different modalities (based on color histograms, speaker diarization or automatic speech recognition outputs) as well as their multimodal combination. We introduce the notion of character-driven episodes as episodes where stories are emphasized by the presence or absence of characters, and we propose an automatic method, based on a social graph, to detect these episodes. Depending on whether an episode is character-driven or not, the plot-de-interlacing -which is a scene clustering- is made either through a traditional average-link agglomerative clustering with speaker modality only, either through a spectral clustering with the fusion of all modalities. Experiments, conducted on twenty three episodes from three quite different TV series (different lengths and formats), show that the hierarchical framework brings an improvement for all the series.

[1]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[2]  Christine Sénac,et al.  Toward plot de-interlacing in TV series using scenes clustering , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[3]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Shih-Fu Chang,et al.  Condensing computable scenes using visual complexity and film syntax analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[5]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[6]  Chin-Hui Lee,et al.  A detection-based approach to broadcast news video story segmentation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Wei-Ta Chu,et al.  RoleNet: Movie Analysis from the Perspective of Social Networks , 2009, IEEE Transactions on Multimedia.

[8]  Jean-Loup Guillaume,et al.  Fast unfolding of community hierarchies in large networks , 2008, ArXiv.

[9]  Ronan Guivarch,et al.  On a Strategy for Spectral Clustering with Parallel Computation , 2010, VECPAR.

[10]  Hervé Bredin,et al.  Segmentation of TV shows into scenes using speaker diarization and speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..