Shuffled ImageNet Banks for Video Event Detection and Search

This article aims for the detection and search of events in videos, where video examples are either scarce or even absent during training. To enable such event detection and search, ImageNet concept banks have shown to be effective. Rather than employing the standard concept bank of 1,000 ImageNet classes, we leverage the full 21,841-class dataset. We identify two problems with using the full dataset: (i) there is an imbalance between the number of examples per concept, and (ii) not all concepts are equally relevant for events. In this article, we propose to balance large-scale image hierarchies for pre-training. We shuffle concepts based on bottom-up and top-down operations to overcome the problems of example imbalance and concept relevance. Using this strategy, we arrive at the shuffled ImageNet bank, a concept bank with an order of magnitude more concepts compared to standard ImageNet banks. Compared to standard ImageNet pre-training, our shuffles result in more discriminative representations to train event models from the limited video event examples. For event search, the broad range of concepts enable a closer match between textual queries of events and concept detections in videos. Experimentally, we show the benefit of the proposed bank for event detection and event search, with state-of-the-art performance for both tasks on the challenging TRECVID Multimedia Event Detection and Ad-Hoc Video Search benchmarks.

[1]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Ramakant Nevatia,et al.  Large-scale web video event classification by use of Fisher Vectors , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[5]  Xirong Li,et al.  Early Embedding and Late Reranking for Video Captioning , 2016, ACM Multimedia.

[6]  Cees Snoek,et al.  Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Mark D. McDonnell,et al.  Understanding Data Augmentation for Classification: When to Warp? , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[8]  Xirong Li,et al.  Detecting Violence in Video using Subclasses , 2016, ACM Multimedia.

[9]  Nicu Sebe,et al.  The Many Shades of Negativity , 2017, IEEE Transactions on Multimedia.

[10]  Hui Cheng,et al.  Video event recognition using concept attributes , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[11]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[12]  Jarke J. van Wijk,et al.  Comparing Personal Image Collections with PICTuReVis , 2017, Comput. Graph. Forum.

[13]  Yi Yang,et al.  Bidirectional Multirate Reconstruction for Temporal Modeling in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Cees Snoek,et al.  Bag-of-Fragments: Selecting and Encoding Video Fragments for Event Detection and Recounting , 2015, ICMR.

[15]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yi Yang,et al.  Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Shih-Fu Chang,et al.  Minimally Needed Evidence for Complex Event Recognition in Unconstrained Videos , 2014, ICMR.

[19]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yi Yang,et al.  Content-Based Video Search over 1 Million Videos with 1 Core in 1 Second , 2015, ICMR.

[22]  Xirong Li,et al.  TagBook: A Semantic Video Representation Without Supervision for Event Detection , 2015, IEEE Transactions on Multimedia.

[23]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[24]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[25]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Shih-Fu Chang,et al.  CU-VIREO 374 : Fusing Columbia 374 and VIREO 374 for Large Scale Semantic Concept Detection , 2008 .

[27]  Chong-Wah Ngo,et al.  Selection of Concept Detectors for Video Search by Ontology-Enriched Semantic Spaces , 2008, IEEE Transactions on Multimedia.

[28]  Koichi Shinoda,et al.  Adaptation of Word Vectors using Tree Structure for Visual Semantics , 2016, ACM Multimedia.

[29]  Nitish Srivastava,et al.  Exploiting Image-trained CNN Architectures for Unconstrained Video Classification , 2015, BMVC.

[30]  Yongdong Zhang,et al.  Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base , 2015, IEEE Transactions on Multimedia.

[31]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[32]  Yi Yang,et al.  Bi-Level Semantic Representation Analysis for Multimedia Event Detection , 2017, IEEE Transactions on Cybernetics.

[33]  Yi Yang,et al.  Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[35]  Peng Wang,et al.  Web Video Event Recognition by Semantic Analysis From Ubiquitous Documents , 2016, IEEE Transactions on Image Processing.

[36]  Mubarak Shah,et al.  Complex Events Detection Using Data-Driven Concepts , 2012, ECCV.

[37]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[38]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[39]  Zi Huang,et al.  Robust spatial-temporal deep model for multimedia event detection , 2016, Neurocomputing.

[40]  Yi Yang,et al.  Fast and Accurate Content-based Semantic Search in 100M Internet Videos , 2015, ACM Multimedia.

[41]  Yejin Choi,et al.  From Large Scale Image Categorization to Entry-Level Categories , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Jin Zhao,et al.  Video Retrieval Using High Level Features: Exploiting Query Matching and Confidence-Based Weighting , 2006, CIVR.

[43]  Koen E. A. van de Sande,et al.  All vehicles are cars: subclass preferences in container concepts , 2012, ICMR '12.

[44]  Céline Hudelot,et al.  Learning Finer-class Networks for Universal Representations , 2018, BMVC.

[45]  Dong Wang,et al.  The importance of query-concept-mapping for automatic video retrieval , 2007, ACM Multimedia.

[46]  Yi Yang,et al.  Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM , 2015, ICML.

[47]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[48]  Georges Quénot,et al.  TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking , 2017, TRECVID.

[49]  Masoud Mazloom,et al.  Conceptlets: Selective Semantics for Classifying Video Events , 2014, IEEE Transactions on Multimedia.

[50]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[51]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[52]  Yi Yang,et al.  Compound Memory Networks for Few-Shot Video Classification , 2018, ECCV.

[53]  Cees Snoek,et al.  UvA-DARE ( Digital Academic Repository ) Event Fisher Vectors : Robust Encoding Visual Diversity of Visual Streams , 2015 .

[54]  Chengqi Zhang,et al.  Dynamic Concept Composition for Zero-Example Event Detection , 2016, AAAI.

[55]  Zi Huang,et al.  Leveraging Weak Semantic Relevance for Complex Video Event Classification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Nicu Sebe,et al.  Joint Attributes and Event Analysis for Multimedia Event Detection , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[57]  Hanzhang Wang,et al.  Categorizing Concepts with Basic Level for Vision-to-Language , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Cees Snoek,et al.  Video2vec Embeddings Recognize Events When Examples Are Scarce , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Ioannis Patras,et al.  Query and Keyframe Representations for Ad-hoc Video Search , 2017, ICMR.

[60]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[61]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[62]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[63]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  John R. Smith,et al.  Modal Keywords, Ontologies, and Reasoning for Video Understanding , 2003, CIVR.

[65]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[66]  Cees Snoek,et al.  Recommendations for recognizing video events by concept vocabularies , 2014, Comput. Vis. Image Underst..

[67]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Larry S. Davis,et al.  Selecting Relevant Web Trained Concepts for Automated Event Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[69]  Yi Yang,et al.  They are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[71]  Yi Yang,et al.  You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Yu-Gang Jiang,et al.  VIREO-374 : LSCOM Semantic Concept Detectors Using Local Keypoint Features , 2007 .

[73]  Marcel Worring,et al.  Adding Semantics to Detectors for Video Retrieval , 2007, IEEE Transactions on Multimedia.

[74]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Shih-Fu Chang,et al.  Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts , 2007 .

[77]  Cees Snoek,et al.  VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[78]  Cees Snoek,et al.  Objects2action: Classifying and Localizing Actions without Any Video Example , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[79]  Dong Liu,et al.  EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[80]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.