Recommendations for recognizing video events by concept vocabularies

Representing videos using vocabularies composed of concept detectors appears promising for generic event recognition. While many have recently shown the benefits of concept vocabularies for recognition, studying the characteristics of a universal concept vocabulary suited for representing events is ignored. In this paper, we study how to create an effective vocabulary for arbitrary-event recognition in web video. We consider five research questions related to the number, the type, the specificity, the quality and the normalization of the detectors in concept vocabularies. A rigorous experimental protocol using a pool of 1346 concept detectors trained on publicly available annotations, two large arbitrary web video datasets and a common event recognition pipeline allow us to analyze the performance of various concept vocabulary definitions. From the analysis we arrive at the recommendation that for effective event recognition the concept vocabulary should (i) contain more than 200 concepts, (ii) be diverse by covering object, action, scene, people, animal and attribute concepts, (iii) include both general and specific concepts, (iv) increase the number of concepts rather than improve the quality of the individual detectors, and (v) contain detectors that are appropriately normalized. We consider the recommendations for recognizing video events by concept vocabularies the most important contribution of the paper, as they provide guidelines for future work.

[1]  Masoud Mazloom,et al.  Searching informative concept banks for video event detection , 2013, ICMR.

[2]  Arnold W. M. Smeulders,et al.  Visual-Concept Search Solved? , 2010, Computer.

[3]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[4]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Chong-Wah Ngo,et al.  Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study , 2010, IEEE Transactions on Multimedia.

[6]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[8]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[9]  Nicu Sebe,et al.  Complex Event Detection via Multi-source Video Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Alexander G. Hauptmann,et al.  Leveraging high-level and low-level features for multimedia event detection , 2012, ACM Multimedia.

[11]  Hui Cheng,et al.  Video event recognition using concept attributes , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[12]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[13]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[14]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[16]  Jun Yang,et al.  (Un)Reliability of video concept detection , 2008, CIVR '08.

[17]  Mohan S. Kankanhalli,et al.  Modeling, detecting, and processing events in multimedia , 2010, ACM Multimedia.

[18]  Jean-Marc Odobez,et al.  Bridging the past, present and future: Modeling scene activities from event relationships and global rules , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  A. G. Amitha Perera,et al.  Multimedia event detection with multimodal feature fusion and temporal concept localization , 2013, Machine Vision and Applications.

[21]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[22]  Nicu Sebe,et al.  Multimedia Event Detection Using A Classifier-Specific Intermediate Representation , 2013, IEEE Transactions on Multimedia.

[23]  George A. Miller WordNet: A Lexical Database for English , 1992, HLT.

[24]  Yu-Gang Jiang,et al.  SUPER: towards real-time event recognition in internet videos , 2012, ICMR.

[25]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[26]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[27]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[28]  M. Omizo,et al.  Modeling , 1983, Encyclopedic Dictionary of Archaeology.

[29]  Subhransu Maji,et al.  Efficient Classification for Additive Kernel SVMs , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Benoit Huet,et al.  Large-Scale Multimedia Data Collections , 2012, IEEE Multim..

[31]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[32]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Jeff Z. Pan,et al.  Multimedia annotations on the semantic Web , 2006, IEEE Multimedia.

[34]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[36]  Noboru Babaguchi,et al.  Event based indexing of broadcasted sports video by intermodal collaboration , 2002, IEEE Trans. Multim..

[37]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[38]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[39]  Markus Koch,et al.  Learning automatic concept detectors from online video , 2010, Comput. Vis. Image Underst..

[40]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[41]  Nuno Vasconcelos,et al.  Holistic Context Models for Visual Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[43]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[44]  Mubarak Shah,et al.  Complex Events Detection Using Data-Driven Concepts , 2012, ECCV.

[45]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[46]  Trevor Darrell,et al.  Detection bank: an object detection based video representation for multimedia event recognition , 2012, ACM Multimedia.

[47]  Cordelia Schmid,et al.  AXES at TRECVID 2012: KIS, INS, and MED , 2012, TRECVID.

[48]  Raphaël Troncy,et al.  Finding media illustrating events , 2011, ICMR '11.

[49]  Ramakant Nevatia,et al.  Evaluating multimedia features and fusion for example-based event detection , 2013, Machine Vision and Applications.

[50]  Mubarak Shah,et al.  Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[51]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Lexing Xie,et al.  Event Mining in Multimedia Streams , 2008, Proceedings of the IEEE.

[53]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[54]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[55]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[56]  Nicu Sebe,et al.  Knowledge adaptation for ad hoc multimedia event detection with few exemplars , 2012, ACM Multimedia.

[57]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Alberto Del Bimbo,et al.  Event detection and recognition for semantic annotation of video , 2010, Multimedia Tools and Applications.

[59]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[60]  Terrance E. Boult,et al.  Multi-attribute spaces: Calibration for attribute fusion and similarity search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[62]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).