Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection

We focus on detecting complex events in unconstrained Internet videos. While most existing works rely on the abundance of labeled training data, we consider a more difficult zero-shot setting where no training data is supplied. We first pre-train a number of concept classifiers using data from other sources. Then we evaluate the semantic correlation of each concept w.r.t. the event of interest. After further refinement to take prediction inaccuracy and discriminative power into account, we apply the discovered concept classifiers on all test videos and obtain multiple score vectors. These distinct score vectors are converted into pairwise comparison matrices and the nuclear norm rank aggregation framework is adopted to seek consensus. To address the challenging optimization formulation, we propose an efficient, highly scalable algorithm that is an order of magnitude faster than existing alternatives. Experiments on recent TRECVID datasets verify the superiority of the proposed approach.

[1]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[2]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[3]  Arieh Iserles,et al.  On the Foundations of Computational Mathematics , 2001 .

[4]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  D. Hinkley Annals of Statistics , 2006 .

[6]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[7]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[9]  Yi Yang,et al.  Ranking with local regression and global alignment for cross media retrieval , 2009, ACM Multimedia.

[10]  Yueting Zhuang,et al.  Tensor-Based Transductive Learning for Multimodality Video Semantic Concept Detection , 2009, IEEE Transactions on Multimedia.

[11]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  David F. Gleich,et al.  Rank aggregation via nuclear norm minimization , 2011, KDD.

[13]  Pablo A. Parrilo,et al.  The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[14]  Yaoliang Yu,et al.  Accelerated Training for Matrix-norm Regularization: A Boosting Approach , 2012, NIPS.

[15]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[16]  Dong Liu,et al.  Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[19]  Nicu Sebe,et al.  Complex Event Detection via Multi-source Video Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[22]  Yong Tang,et al.  Rank Aggregation via Low-Rank and Structured-Sparse Decomposition , 2013, AAAI.

[23]  Masoud Mazloom,et al.  Searching informative concept banks for video event detection , 2013, ICMR.

[24]  Ali Farhadi,et al.  Multi-attribute Queries: To Merge or Not to Merge? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Nuno Vasconcelos,et al.  Dynamic Pooling for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Sangmin Oh,et al.  Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Nicu Sebe,et al.  Multi-task linear discriminant analysis for multi-view action recognition , 2013, 2013 IEEE International Conference on Image Processing.

[29]  James Allan,et al.  Zero-shot video retrieval using content and concepts , 2013, CIKM.

[30]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[31]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Subramanian Ramanathan,et al.  Multitask Linear Discriminant Analysis for View Invariant Action Recognition , 2014, IEEE Transactions on Image Processing.

[34]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[35]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Chong-Wah Ngo,et al.  Multimedia Event Detection , 2015 .