Listen, look, and gotcha: instant video search with mobile phones by layered audio-video indexing

Mobile video is quickly becoming a mass consumer phenomenon. More and more people are using their smartphones to search and browse video content while on the move. In this paper, we have developed an innovative instant mobile video search system through which users can discover videos by simply pointing their phones at a screen to capture a very few seconds of what they are watching. The system is able to index large-scale video data using a new layered audio-video indexing approach in the cloud, as well as extract light-weight joint audio-video signatures in real time and perform progressive search on mobile devices. Unlike most existing mobile video search applications that simply send the original video query to the cloud, the proposed mobile system is one of the first attempts at instant and progressive video search leveraging the light-weight computing capacity of mobile devices. The system is characterized by four unique properties: 1) a joint audio-video signature to deal with the large aural and visual variances associated with the query video captured by the mobile phone, 2) layered audio-video indexing to holistically exploit the complementary nature of audio and video signals, 3) light-weight fingerprinting to comply with mobile processing capacity, and 4) a progressive query process to significantly reduce computational costs and improve the user experience---the search process can stop anytime once a confident result is achieved. We have collected 1,400 query videos captured by 25 mobile users from a dataset of 600 hours of video. The experiments show that our system outperforms state-of-the-art methods by achieving 90.79% precision when the query video is less than 10 seconds and 70.07% even when the query video is less than 5 seconds.

[1]  David J. Fleet,et al.  Minimal Loss Hashing for Compact Binary Codes , 2011, ICML.

[2]  David J. Fleet,et al.  Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Tao Mei,et al.  Scalable clip-based near-duplicate video detection with ordinal measure , 2010, CIVR '10.

[4]  Bernd Girod,et al.  Mobile Visual Search , 2011, IEEE Signal Processing Magazine.

[5]  Bernd Girod,et al.  CHoG: Compressed histogram of gradients A low bit-rate feature descriptor , 2009, CVPR.

[6]  Wen Gao,et al.  Towards low bit rate mobile visual search with multiple-channel coding , 2011, ACM Multimedia.

[7]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Fei Wang,et al.  Real-time large scale near-duplicate web video retrieval , 2010, ACM Multimedia.

[9]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[10]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[11]  Xin Yang,et al.  Accelerating SURF detector on mobile devices , 2012, ACM Multimedia.

[12]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[13]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[14]  Chong-Wah Ngo,et al.  Coherent bag-of audio words model for efficient large-scale video copy detection , 2010, CIVR '10.

[15]  Tao Mei,et al.  Finding perfect rendezvous on the go: accurate mobile visual localization and its applications to routing , 2012, ACM Multimedia.

[16]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[17]  Tao Mei,et al.  Local visual words coding for low bit rate mobile visual search , 2012, ACM Multimedia.

[18]  Winston H. Hsu,et al.  Sketch-based image retrieval on mobile devices using compact hash bits , 2012, ACM Multimedia.

[19]  Zi Huang,et al.  Effective and Efficient Query Processing for Video Subsequence Identification , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Bernd Girod,et al.  Dynamic selection of a feature-rich query frame for mobile video retrieval , 2010, 2010 IEEE International Conference on Image Processing.

[21]  D. Lowe,et al.  Fast Matching of Binary Features , 2012, 2012 Ninth Conference on Computer and Robot Vision.

[22]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  Winston H. Hsu,et al.  Scalable mobile video question-answering system with locally aggregated descriptors and random projection , 2011, MM '11.

[24]  Shih-Fu Chang,et al.  Mobile product search with Bag of Hash Bits and boundary reranking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Hung-Khoon Tan,et al.  Real-Time Near-Duplicate Elimination for Web Video Search With Content and Context , 2009, IEEE Transactions on Multimedia.

[26]  Max Mühlhäuser,et al.  Mobile interaction techniques for interrelated videos , 2010, CHI EA '10.