Beat-Event Detection in Action Movie Franchises

While important advances were recently made towards temporally localizing and recognizing specific human actions or activities in videos, efficient detection and classification of long video chunks belonging to semantically defined categories such as “pursuit” or “romance” remains challenging. We introduce a new dataset, Action Movie Franchises, consisting of a collection of Hollywood action movie franchises. We define 11 non-exclusive semantic categories — called beat-categories — that are broad enough to cover most of the movie footage. The corresponding beat-events are annotated as groups of video shots, possibly overlapping. We propose an approach for localizing beat-events based on classifying shots into beat-categories and learning the temporal constraints between shots. We show that temporal constraints significantly improve the classification performance. We set up an evaluation protocol for beat-event localization as well as for shot classification, depending on whether movies from the same franchise are present or not in the training data.

[1]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Bertrand Chupeau,et al.  A Video Fingerprint Based on Visual Digest and Local Fingerprints , 2006, 2006 International Conference on Image Processing.

[3]  Cordelia Schmid,et al.  The AXES submissions at TRECVID 2013 , 2013, TRECVID.

[4]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[5]  Rémi Ronfard,et al.  A framework for aligning and indexing movies with their script , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[6]  Thomas S. Huang,et al.  Exploring video structure beyond the shots , 1998, Proceedings. IEEE International Conference on Multimedia Computing and Systems (Cat. No.98TB100241).

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Blake Snyder Save the Cat: The Last Book on Screenwriting You'll Ever Need , 2005 .

[10]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[11]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[12]  Ying Li,et al.  Content-based movie analysis and indexing based on audiovisual cues , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Cordelia Schmid,et al.  Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[15]  Mohammad Soleymani,et al.  VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation , 2014, Multimedia Tools and Applications.

[16]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[17]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[18]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20]  Julee Cobb,et al.  Hello, My Name Is… , 2016 .

[21]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[22]  Cordelia Schmid,et al.  The LEAR submission at Thumos 2014 , 2014 .

[23]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Ronald W. Schafer,et al.  Introduction to Digital Speech Processing , 2007, Found. Trends Signal Process..

[26]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[27]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[28]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[29]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[30]  Andrew Zisserman,et al.  Fisher Vector Faces in the Wild , 2013, BMVC.

[31]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.