Scalable Mobile Video Retrieval with Sparse Projection Learning and Pseudo Label Mining

Retrieving relevant videos from a large corpus on mobile devices is a vital challenge. This article addresses two key issues for mobile search on user-generated videos. The first is the lack of good relevance measurement for learning semantically rich representations, due to the unconstrained nature of online videos. The second is the limited resources on mobile devices, stringent bandwidth, and delay requirement between the device and video server. The authors propose a knowledge-embedded sparse projection learning approach. To alleviate the need for expensive annotation in hash learning, they investigate varying approaches for pseudo label mining, where explicit semantic analysis leverages Wikipedia. In addition, they propose a novel sparse projection method to address the efficiency challenge by learning a discriminative compact representation that drastically reduces transmission costs. With less than 10 percent nonzero elements in the projection matrix, it also reduces computational and storage costs. The experimental results on 100,000 videos show that the proposed algorithm yields performance competitive with the prior state-of-the-art hashing methods, which are not applicable for mobiles and solely rely on costly manual annotations. The average query time for 100,000 videos was only 0.592 seconds.

[1]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[2]  Kuo-Wei Chang,et al.  Evaluating Gaussian Like Image Representations over Local Features , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[3]  John R. Kender,et al.  Visual memes in social media: tracking real-world news in YouTube videos , 2011, ACM Multimedia.

[4]  Tat-Seng Chua,et al.  Exploring large scale data for multimedia QA: an initial study , 2010, CIVR '10.

[5]  Shih-Fu Chang,et al.  Mobile product search with Bag of Hash Bits and boundary reranking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[7]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[8]  Bernd Girod,et al.  Mobile Visual Search , 2011, IEEE Signal Processing Magazine.

[9]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Bernd Girod,et al.  Dynamic selection of a feature-rich query frame for mobile video retrieval , 2010, 2010 IEEE International Conference on Image Processing.

[11]  Catherine Havasi,et al.  ConceptNet 3 : a Flexible , Multilingual Semantic Network for Common Sense Knowledge , 2007 .

[12]  Zi Huang,et al.  Multiple feature hashing for real-time large scale near-duplicate video retrieval , 2011, ACM Multimedia.

[13]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[14]  Shih-Fu Chang,et al.  Sequential Projection Learning for Hashing with Compact Codes , 2010, ICML.

[15]  Yurii Nesterov,et al.  Generalized Power Method for Sparse Principal Component Analysis , 2008, J. Mach. Learn. Res..