Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos

Semantic search in video is a novel and challenging problem in information and multimedia retrieval. Existing solutions are mainly limited to text matching, in which the query words are matched against the textual metadata generated by users. This paper presents a state-of-the-art system for event search without any textual metadata or example videos. The system relies on substantial video content understanding and allows for semantic search over a large collection of videos. The novelty and practicality is demonstrated by the evaluation in NIST TRECVID 2014, where the proposed system achieves the best performance. We share our observations and lessons in building such a state-of-the-art system, which may be instrumental in guiding the design of the future system for semantic search in video.

[1]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[2]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[4]  Teruko Mitamura,et al.  Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[5]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[6]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[7]  Yu He,et al.  The YouTube video recommendation system , 2010, RecSys '10.

[8]  Xiaojun Chang,et al.  Incremental Multimodal Query Construction for Video Search , 2015, ICMR.

[9]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[10]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[11]  Vasileios Mezaris,et al.  Video event detection using generalized subclass discriminant analysis and linear support vector machines , 2014, ICMR.

[12]  Hyungtae Lee,et al.  Analyzing Complex Events and Human Actions in "in-the-wild" Videos , 2014 .

[13]  Chong-Wah Ngo,et al.  Video Event Detection Using Motion Relativity and Feature Selection , 2014, IEEE Transactions on Multimedia.

[14]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[15]  Paul Over,et al.  TRECVID 2008 - Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2010, TRECVID.

[16]  Deyu Meng,et al.  Towards Efficient Learning of Optimal Spatial Bag-of-Words Representations , 2014, ICMR.

[17]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[19]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[20]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[24]  Florian Metze,et al.  Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training , 2013, INTERSPEECH.

[25]  Alexander G. Hauptmann,et al.  Leveraging high-level and low-level features for multimedia event detection , 2012, ACM Multimedia.

[26]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[27]  J. Jenkins,et al.  Word association norms , 1964 .

[28]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Xirong Li,et al.  Few-Example Video Event Retrieval using Tag Propagation , 2014, ICMR.

[31]  A. G. Amitha Perera,et al.  Multimedia event detection with multimodal feature fusion and temporal concept localization , 2013, Machine Vision and Applications.

[32]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[33]  Florian Metze,et al.  Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[34]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[35]  James Allan,et al.  Zero-shot video retrieval using content and concepts , 2013, CIKM.

[36]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[37]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[38]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[39]  Koichi Shinoda,et al.  n-gram Models for Video Semantic Indexing , 2014, ACM Multimedia.

[40]  Yi Yang,et al.  E-LAMP: integration of innovative ideas for multimedia event detection , 2013, Machine Vision and Applications.

[41]  Qi Xie,et al.  Self-Paced Learning for Matrix Factorization , 2015, AAAI.

[42]  Benoit Huet,et al.  When textual and visual information join forces for multimedia retrieval , 2014, ICMR.

[43]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[44]  Florian Metze,et al.  Improvements to speaker adaptive training of deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[45]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Deyu Meng,et al.  Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[47]  Shih-Fu Chang,et al.  Minimally Needed Evidence for Complex Event Recognition in Unconstrained Videos , 2014, ICMR.

[48]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[49]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.