Content-Based Video Search over 1 Million Videos with 1 Core in 1 Second

Many content-based video search (CBVS) systems have been proposed to analyze the rapidly-increasing amount of user-generated videos on the Internet. Though the accuracy of CBVS systems have drastically improved, these high accuracy systems tend to be too inefficient for interactive search. Therefore, to strive for real-time web-scale CBVS, we perform a comprehensive study on the different components in a CBVS system to understand the trade-offs between accuracy and speed of each component. Directions investigated include exploring different low-level and semantics-based features, testing different compression factors and approximations during video search, and understanding the time v.s. accuracy trade-off of reranking. Extensive experiments on data sets consisting of more than 1,000 hours of video showed that through a combination of effective features, highly compressed representations, and one iteration of reranking, our proposed system can achieve an 10,000-fold speedup while retaining 80% accuracy of a state-of-the-art CBVS system. We further performed search over 1 million videos and demonstrated that our system can complete the search in 0.975 seconds with a single core, which potentially opens the door to interactive web-scale CBVS for the general public.

[1]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[2]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[3]  Yi Yang,et al.  Resource Constrained Multimedia Event Detection , 2014, MMM.

[4]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Yi Yang,et al.  How Related Exemplars Help Complex Event Detection in Web Videos? , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[9]  Alexander G. Hauptmann,et al.  Leveraging high-level and low-level features for multimedia event detection , 2012, ACM Multimedia.

[10]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Masoud Mazloom,et al.  Searching informative concept banks for video event detection , 2013, ICMR.

[13]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[14]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Yi Yang,et al.  E-LAMP: integration of innovative ideas for multimedia event detection , 2013, Machine Vision and Applications.

[16]  Deyu Meng,et al.  Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[17]  Shiguang Shan,et al.  Informedia@TrecVID 2014: MED and MER , 2014 .

[18]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[19]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[20]  Ping Li,et al.  Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[21]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[22]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Jorma Laaksonen,et al.  Large-scale visual concept detection with explicit kernel maps and power mean SVM , 2013, ICMR.

[24]  Cees Snoek,et al.  VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[25]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[26]  R. Manmatha,et al.  Modeling Concept Dependencies for Event Detection , 2014, ICMR.

[27]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.