TVParser: An automatic TV video parsing method

In this paper, we propose an automatic approach to simultaneously name faces and discover scenes in TV shows. We follow the multi-modal idea of utilizing script to assist video content understanding, but without using timestamp (provided by script-subtitles alignment) as the connection. Instead, the temporal relation between faces in the video and names in the script is investigated in our approach, and an global optimal video-script alignment is inferred according to the character correspondence. The contribution of this paper is two-fold: (1) we propose a generative model, named TVParser, to depict the temporal character correspondence between video and script, from which face-name relationship can be automatically learned as a model parameter, and meanwhile, video scene structure can be effectively inferred as a hidden state sequence; (2) we find fast algorithms to accelerate both model parameter learning and state inference, resulting in an efficient and global optimal alignment. We conduct extensive comparative experiments on popular TV series and report comparable and even superior performance over existing methods.

[1]  Fei Wang,et al.  Semi-Supervised Clustering via Matrix Factorization , 2008, SDM.

[2]  Andrew Zisserman,et al.  Automatic face recognition for film character retrieval in feature-length films , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[4]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Andrew Zisserman,et al.  Efficient Visual Search of Videos Cast as Text Retrieval , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Andrew Zisserman,et al.  "Who are you?" - Learning person specific classifiers from video , 2009, CVPR.

[7]  Chris H. Q. Ding,et al.  Convex and Semi-Nonnegative Matrix Factorizations , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  C. V. Jawahar,et al.  Subtitle-free Movie to Script Alignment , 2009, BMVC.

[9]  Yee Whye Teh,et al.  Names and faces in the news , 2004, CVPR 2004.

[10]  Hakan Cevikalp,et al.  Face recognition based on image sets , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[12]  Cordelia Schmid,et al.  Automatic face naming with caption-based supervision , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[14]  Shunzheng Yu,et al.  Hidden semi-Markov models , 2010, Artif. Intell..

[15]  Zhenguo Li,et al.  Constrained clustering via spectral regularization , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Mubarak Shah,et al.  Detection and representation of scenes in videos , 2005, IEEE Transactions on Multimedia.

[17]  Amnon Shashua,et al.  A unifying approach to hard and probabilistic clustering , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[18]  B. Taskar,et al.  Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[20]  Dan Klein,et al.  Spectral Learning , 2003, IJCAI.

[21]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, ICML '05.

[22]  Changsheng Xu,et al.  Character-based movie summarization , 2010, ACM Multimedia.

[23]  Alessandro Vinciarelli,et al.  Broadcast news story segmentation using social network analysis and hidden markov models , 2007, ACM Multimedia.

[24]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[25]  ZissermanAndrew,et al.  Efficient Visual Search of Videos Cast as Text Retrieval , 2009 .

[26]  Pinar Duygulu Sahin,et al.  Interesting faces: A graph-based approach for finding people in news , 2010, Pattern Recognit..

[27]  Wen Gao,et al.  Manifold-Manifold Distance with application to face recognition based on image set , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Changsheng Xu,et al.  A Novel Role-Based Movie Scene Segmentation Method , 2009, PCM.