MovieBase: a movie database for event detection and behavioral analysis

The overwhelming amount of multimedia entities shared over the web has given rise to the need for semantic identification and classification of these entities. Numerous research efforts have tackled this problem by developing advanced content analysis techniques as well as leveraging on readily available tags, scripts, and blogs related to these multimedia entities. However, in many cases, especially for event detection and action recognition, the research efforts were hampered by the lack of large scale publicly available benchmarks. To address this problem, this paper presents a large-scale movie corpus named MovieBase that covers full length feature movies as well as huge volume of movie-related video clips downloaded from YouTube. The corpus is designed for research in event detection and action recognition. It offers over 71 hours of videos with a total of 69,129 shots. The corpus has been hand-labeled according to 7 audio and 11 visual concept tags to semantically define 11 event categories under the romantic and violence scenes. The corpus comes with a set of pre-extracted low-level visual, motion, audio as well as high-level features. Related results are furnished as a baseline for the movie event detection task.

[1]  Xian-Sheng Hua,et al.  Multi-modality web video categorization , 2007, MIR '07.

[2]  Andrew Zisserman,et al.  Video data mining using configurations of viewpoint invariant regions , 2004, CVPR 2004.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Rita Cucchiara,et al.  ViSOR: VIdeo Surveillance On-line Repository for annotation retrieval , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[5]  C. Fellbaum An Electronic Lexical Database , 1998 .

[6]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[7]  Lifeng Sun,et al.  Web video topic discovery and tracking via bipartite graph reinforcement model , 2008, WWW.

[8]  Tat-Seng Chua,et al.  Temporal Multi-Resolution Framework for Shot Boundary Detection and Keyframe Extraction , 2002, TREC.

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[11]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[12]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Robert T. Collins,et al.  An Open Source Tracking Testbed and Evaluation Web Site , 2005 .

[14]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[15]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Shih-Fu Chang,et al.  CU-VIREO 374 : Fusing Columbia 374 and VIREO 374 for Large Scale Semantic Concept Detection , 2008 .

[17]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[18]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[19]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[21]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[22]  Sheng Tang,et al.  TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS , 2007, TRECVID.

[23]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[24]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[25]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[26]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[27]  Jiangchuan Liu,et al.  Statistics and Social Network of YouTube Videos , 2008, 2008 16th Interntional Workshop on Quality of Service.

[28]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[29]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[30]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .