Dynamic Concept Composition for Zero-Example Event Detection

In this paper, we focus on automatically detecting events in unconstrained videos without the use of any visual training exemplars. In principle, zero-shot learning makes it possible to train an event detection model based on the assumption that events (e.g. birthday party) can be described by multiple mid-level semantic concepts (e.g. "blowing candle", "birthday cake"). Towards this goal, we first pre-train a bundle of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest and pick up the relevant concept classifiers, which are applied on all test videos to get multiple prediction score vectors. While most existing systems combine the predictions of the concept classifiers with fixed weights, we propose to learn the optimal weights of the concept classifiers for each testing video by exploring a set of online available videos with free-form text descriptions of their content. To validate the effectiveness of the proposed approach, we have conducted extensive experiments on the latest TRECVID MEDTest 2014, MEDTest 2013 and CCV dataset. The experimental results confirm the superiority of the proposed approach.

[1]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[3]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Nicu Sebe,et al.  Complex Event Detection via Event Oriented Dictionary Learning , 2015, AAAI.

[5]  Feiping Nie,et al.  Large-Scale Multi-View Spectral Clustering via Bipartite Graph , 2015, AAAI.

[6]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[7]  Alain Rakotomamonjy,et al.  Sparse Support Vector Infinite Push , 2012, ICML.

[8]  Yi Yang,et al.  Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision , 2015, ACM Multimedia.

[9]  Nicu Sebe,et al.  Event Oriented Dictionary Learning for Complex Event Detection , 2015, IEEE Transactions on Image Processing.

[10]  Igor Durdanovic,et al.  Parallel Support Vector Machines: The Cascade SVM , 2004, NIPS.

[11]  Ramakant Nevatia,et al.  DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Dong Liu,et al.  Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images , 2014, ICMR.

[13]  Gang Hua,et al.  Scene Aligned Pooling for Complex Video Recognition , 2012, ECCV.

[14]  Ming-Syan Chen,et al.  Video Event Detection by Inferring Temporal Instance Labels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[16]  Masoud Mazloom,et al.  Searching informative concept banks for video event detection , 2013, ICMR.

[17]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[18]  Chong-Wah Ngo,et al.  Multimedia Event Detection , 2015 .

[19]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[20]  Yi Yang,et al.  Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection , 2015, IJCAI.

[21]  Ali Farhadi,et al.  Multi-attribute Queries: To Merge or Not to Merge? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[23]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[24]  Yaoliang Yu,et al.  On Decomposing the Proximal Map , 2013, NIPS.

[25]  Feiping Nie,et al.  Clustering and projected clustering with adaptive neighbors , 2014, KDD.

[26]  Yi Yang,et al.  Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM , 2015, ICML.

[27]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[29]  Dong Liu,et al.  Learning Sample Specific Weights for Late Fusion , 2015, IEEE Transactions on Image Processing.

[30]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[32]  Nuno Vasconcelos,et al.  Dynamic Pooling for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[34]  Nicu Sebe,et al.  Knowledge adaptation for ad hoc multimedia event detection with few exemplars , 2012, ACM Multimedia.

[35]  Dong Liu,et al.  Sample-Specific Late Fusion for Visual Category Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Sharath Pankanti,et al.  Temporal Sequence Modeling for Video Event Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  James Allan,et al.  Zero-shot video retrieval using content and concepts , 2013, CIKM.

[39]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  E. Rosch,et al.  Cognition and Categorization , 1980 .

[41]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[42]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.