Fast object instance search in videos from one example

We present an efficient approach to search for and locate all occurrences of a specific object in large video volumes, given a single query example. Locations of object occurrences are returned as spatio-temporal trajectories in the 3D video volume. Despite much work on object instance search in image datasets, these methods locate the object independently in each image, therefore do not preserve the spatio-temporal consistency in consecutive video frames. This results in sub-optimal performance if directly applied to videos, as will be shown in our experiments. We propose to locate the object jointly across video frames using spatio-temporal search. The efficiency and effectiveness of the proposed approach is demonstrated on a consumer video dataset consisting of crawled YouTube videos and mobile captured consumer clips. Our method significantly improves the localized search accuracy over the baseline, which treats each frame independently. Moreover, it is able to find the top 100 object trajectories in the 5.5-hour dataset within 30 seconds.

[1]  Shin'ichi Satoh,et al.  Efficient instance search from large video database via sparse filters in subspaces , 2013, 2013 IEEE International Conference on Image Processing.

[2]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Duy-Dinh Le,et al.  National Institute of Informatics, Japan at TRECVID 2008 , 2008, TRECVID.

[4]  Jiri Matas,et al.  Efficient representation of local geometry for large scale object retrieval , 2009, CVPR.

[5]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[6]  Shin'ichi Satoh,et al.  Large vocabulary quantization for searching instances from videos , 2012, ICMR '12.

[7]  Gang Wang,et al.  Object instance search in videos , 2013, 2013 9th International Conference on Information, Communications & Signal Processing.

[8]  Yuning Jiang,et al.  Randomized visual phrases for object search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  David A. Forsyth,et al.  Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[11]  Christoph H. Lampert Detecting objects in large image collections and videos by efficient subimage retrieval , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Yuning Jiang,et al.  Randomized Spatial Context for Object Search , 2015, IEEE Transactions on Image Processing.

[13]  Jing Zhang,et al.  Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Arnold W. M. Smeulders,et al.  Locality in Generic Instance Search from One Example , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Kunio Kashino,et al.  NTT Communication Science Laboratories and NII in TRECVID 2010 Instance Search Task , 2010, TRECVID.

[18]  Ying Wu,et al.  Object retrieval and localization with spatially-constrained similarity measure and k-NN re-ranking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Junsong Yuan,et al.  Optimal spatio-temporal path discovery for video event detection , 2011, CVPR 2011.

[20]  Yuning Jiang,et al.  Interactive visual object search through mutual information maximization , 2010, ACM Multimedia.

[21]  Andrew Zisserman,et al.  Efficient Visual Search of Videos Cast as Text Retrieval , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.