论文信息 - Unified Embedding and Metric Learning for Zero-Exemplar Event Detection

Unified Embedding and Metric Learning for Zero-Exemplar Event Detection

Event detection in unconstrained videos is conceived as a content-based video retrieval with two modalities: textual and visual. Given a text describing a novel event, the goal is to rank related videos accordingly. This task is zero-exemplar, no video examples are given to the novel event. Related works train a bank of concept detectors on external data sources. These detectors predict confidence scores for test videos, which are ranked and retrieved accordingly. In contrast, we learn a joint space in which the visual and textual representations are embedded. The space casts a novel event as a probability of pre-defined events. Also, it learns to measure the distance between an event and its related videos. Our model is trained end-to-end on publicly available EventNet. When applied to TRECVID Multimedia Event Detection dataset, it outperforms the state-of-the-art by a considerable margin.

[1] Nicu Sebe,et al. Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off , 2015, International Journal of Multimedia Information Retrieval.

[2] Thomas Mensink,et al. VideoStory Embeddings Recognize Events when Examples are Scarce , 2015 .

[3] Yi Yang,et al. They are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Tinne Tuytelaars,et al. Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Deyu Meng,et al. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos , 2015, ICMR.

[6] Teruko Mitamura,et al. Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[7] Bhiksha Raj,et al. Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[8] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[9] Deyu Meng,et al. Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[10] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12] Quoc V. Le,et al. Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[13] Cees Snoek,et al. Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] Yann LeCun,et al. Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Ram Nevatia,et al. Tag-based video retrieval by embedding semantic content in a continuous word space , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17] Yi-Jie Lu. Zero-Example Multimedia Event Detection and Recounting with Unsupervised Evidence Localization , 2016, ACM Multimedia.

[18] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[19] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[20] Shih-Fu Chang,et al. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Cees Snoek,et al. Discovering Semantic Vocabularies for Cross-Media Retrieval , 2015, ICMR.

[22] Brian Antonishek. TRECVID 2010 – An Introduction to the Goals , Tasks , Data , Evaluation Mechanisms , and Metrics , 2010 .

[23] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[25] Bolei Zhou,et al. Places: An Image Database for Deep Scene Understanding , 2016, ArXiv.

[26] Cees Snoek,et al. Video2vec Embeddings Recognize Events When Examples Are Scarce , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Shih-Fu Chang,et al. Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[28] Andrew Zisserman,et al. Fisher Vector Faces in the Wild , 2013, BMVC.

[29] Jaeyoung Choi,et al. A Discriminative and Compact Audio Representation for Event Detection , 2016, ACM Multimedia.

[30] Georges Quénot,et al. TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[31] Yi Yang,et al. Fast and Accurate Content-based Semantic Search in 100M Internet Videos , 2015, ACM Multimedia.

[32] Yongdong Zhang,et al. Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Cees Snoek,et al. VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[34] Ahmed M. Elgammal,et al. Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos , 2015, AAAI.

[35] Shuang Wu,et al. Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[37] Cees Snoek,et al. Video Stream Retrieval of Unseen Queries using Semantic Memory , 2016, BMVC.

[38] Dong Liu,et al. EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[39] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[40] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41] Charu C. Aggarwal,et al. Heterogeneous Network Embedding via Deep Architectures , 2015, KDD.

[42] Andrew Zisserman,et al. All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Masoud Mazloom,et al. Conceptlets: Selective Semantics for Classifying Video Events , 2014, IEEE Transactions on Multimedia.

[44] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[45] Cees Snoek,et al. Bag-of-Fragments: Selecting and Encoding Video Fragments for Event Detection and Recounting , 2015, ICMR.

[46] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] I. Elamvazuthi,et al. Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[48] Chengqi Zhang,et al. Dynamic Concept Composition for Zero-Example Event Detection , 2016, AAAI.

[49] Cees Snoek,et al. COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50] Yi Yang,et al. You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Cees Snoek,et al. Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[52] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Xirong Li,et al. TagBook: A Semantic Video Representation Without Supervision for Event Detection , 2015, IEEE Transactions on Multimedia.

[54] Yi Yang,et al. Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection , 2015, IJCAI.

[55] Wei Chen,et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[56] Lorenzo Torresani,et al. C3D: Generic Features for Video Analysis , 2014, ArXiv.

[57] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.