Aligning plot synopses to videos for story-based retrieval

We propose a method to facilitate search through the storyline of TV series episodes. To this end, we use human written, crowdsourced descriptions—plot synopses—of the story conveyed in the video. We obtain such synopses from websites such as Wikipedia and propose various methods to align each sentence of the plot to shots in the video. Thus, the semantic story-based video retrieval problem is transformed into a much simpler text-based search. Finally, we return the set of shots aligned to the sentences as the video snippet corresponding to the query. The alignment is performed by first computing a similarity score between every shot and sentence through cues such as character identities and keyword matches between plot synopses and subtitles. We then formulate the alignment as an optimization problem and solve it efficiently using dynamic programming. We evaluate our methods on the fifth season of a TV series Buffy the Vampire Slayer and show encouraging results for both the alignment and the retrieval of story events.

[1]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[2]  L. R. Rabiner,et al.  A comparative study of several dynamic time-warping algorithms for connected-word recognition , 1981, The Bell System Technical Journal.

[3]  David F. Rogers,et al.  Mathematical elements for computer graphics (2nd ed.) , 1989 .

[4]  William J. Christmas,et al.  A Study on Automatic Shot Change Detection , 1998, ECMAST.

[5]  Sanjit K. Mitra,et al.  Multimedia Applications and Services , 1999 .

[6]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.

[7]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[8]  Hans-Hellmut Nagel,et al.  Steps toward a Cognitive Vision System , 2004, AI Mag..

[9]  Mubarak Shah,et al.  Detection and representation of scenes in videos , 2005, IEEE Transactions on Multimedia.

[10]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[11]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[12]  Chia-Hung Yeh,et al.  Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques , 2006, IEEE Signal Processing Magazine.

[13]  Li Chen,et al.  Video copy detection: a comparative study , 2007, CIVR '07.

[14]  Hans Weda,et al.  Automated summarization of narrative video on a semantic level , 2007 .

[15]  Marcel Worring,et al.  Adding Semantics to Detectors for Video Retrieval , 2007, IEEE Transactions on Multimedia.

[16]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[18]  Changsheng Xu,et al.  Using Webcast Text for Semantic Event Detection in Broadcast Sports Video , 2008, IEEE Transactions on Multimedia.

[19]  C. V. Jawahar,et al.  Subtitle-free Movie to Script Alignment , 2009, BMVC.

[20]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[21]  Andrew Zisserman,et al.  "Who are you?" - Learning person specific classifiers from video , 2009, CVPR.

[22]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[23]  Ben Taskar,et al.  Learning from ambiguously labeled images , 2009, CVPR.

[24]  B. Taskar,et al.  Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jean-Luc Gauvain,et al.  VoxaleadNews: robust automatic segmentation of video into browsable content , 2009, MM '09.

[26]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, CVPR.

[27]  Ben Taskar,et al.  Talking pictures: Temporal grouping and dialog-supervised person recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Changsheng Xu,et al.  Character-based movie summarization , 2010, ACM Multimedia.

[29]  Yuxin Peng,et al.  Story-Based Retrieval by Learning and Measuring the Concept-Based and Content-Based Similarity , 2010, MMM.

[30]  Cees Snoek,et al.  Crowdsourcing visual detectors for video search , 2011, MM '11.

[31]  Chong-Wah Ngo,et al.  Towards textually describing complex video contents with audio-visual concept classifiers , 2011, ACM Multimedia.

[32]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.

[33]  Rainer Stiefelhagen,et al.  “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Georges Quénot,et al.  Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast , 2012, INTERSPEECH.

[35]  Markus Schedl,et al.  The MediaEval 2013 Affect Task: Violent Scenes Detection , 2013, MediaEval.

[36]  Ian D. Reid,et al.  Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Christine Sénac,et al.  StoViz: story visualization of TV series , 2012, ACM Multimedia.

[38]  Georges Quénot,et al.  Fusion of Speech, Faces and Text for Person Identification in TV Broadcast , 2012, ECCV Workshops.

[39]  Ivan Laptev,et al.  Pose Estimation and Segmentation of People in 3D Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Rainer Stiefelhagen,et al.  Semi-supervised Learning with Constraints for Person Identification in Multimedia Data , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Changsheng Xu,et al.  Script-to-Movie: A Computational Framework for Story Movie Composition , 2013, IEEE Transactions on Multimedia.

[42]  Cees Snoek,et al.  Video2Sentence and vice versa , 2013, MM '13.

[43]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Donghui Wang,et al.  Cross-media topic mining on wikipedia , 2013, MM '13.

[45]  Rainer Stiefelhagen,et al.  Story-based Video Retrieval in TV series using Plot Synopses , 2014, ICMR.

[46]  Erica Klarreich,et al.  Hello, my name is… , 2014, CACM.

[47]  Sanja Fidler,et al.  Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Makarand Tapaswi,et al.  StoryGraphs: Visualizing Character Interactions as a Timeline , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.