Object Instance Search in Videos via Spatio-Temporal Trajectory Discovery

Given a specific object as query, object instance search aims to not only retrieve the images or frames that contain the query, but also locate all its occurrences. In this work, we explore the use of spatio-temporal cues to improve the quality of object instance search from videos. To this end, we formulate this problem as the spatio-temporal trajectory search problem, where a trajectory is a sequence of bounding boxes that locate the object instance in each frame. The goal is to find the top- K trajectories that are likely to contain the target object. Despite the large number of trajectory candidates, we build on a recent spatio- temporal search algorithm for event detection to efficiently find the optimal spatio- temporal trajectories in large video volumes , with complexity linear to the video volume size. We solve the key bottleneck in applying this approach to object instance search by leveraging a randomized approach to enable fast scoring of any bounding boxes in the video volume. In addition , we present a new dataset for video object instance search. Experimental results on a 73-hour video dataset demonstrate that our approach improves the performance of video object instance search and localization over the state-of-the-art search and tracking methods.

[1]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Bernd Girod,et al.  Temporal aggregation for large-scale query-by-image video retrieval , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[3]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[5]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Ming Yang,et al.  Discovery of Collocation Patterns: from Visual Words to Visual Phrases , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Duy-Dinh Le,et al.  National Institute of Informatics, Japan at TRECVID 2008 , 2008, TRECVID.

[9]  Bernd Girod,et al.  Stanford I2V: a news video dataset for query-by-image experiments , 2015, MMSys.

[10]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[11]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[12]  David A. Forsyth,et al.  Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Shin'ichi Satoh,et al.  Efficient instance search from large video database via sparse filters in subspaces , 2013, 2013 IEEE International Conference on Image Processing.

[14]  Shin'ichi Satoh,et al.  Large vocabulary quantization for searching instances from videos , 2012, ICMR '12.

[15]  Andrew Zisserman,et al.  Efficient Visual Search of Videos Cast as Text Retrieval , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Gang Wang,et al.  Object instance search in videos , 2013, 2013 9th International Conference on Information, Communications & Signal Processing.

[17]  Yuning Jiang,et al.  Randomized visual phrases for object search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Cordelia Schmid,et al.  Correlation-based burstiness for logo retrieval , 2012, ACM Multimedia.

[19]  Antonio Torralba,et al.  LabelMe video: Building a video database with human annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Shin'ichi Satoh,et al.  Multi-image aggregation for better visual object retrieval , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Arnold W. M. Smeulders,et al.  Locality in Generic Instance Search from One Example , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jiri Matas,et al.  Efficient representation of local geometry for large scale object retrieval , 2009, CVPR.

[24]  Gang Wang,et al.  Fast object instance search in videos from one example , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[25]  Ying Wu,et al.  Object retrieval and localization with spatially-constrained similarity measure and k-NN re-ranking , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[27]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[30]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[31]  Wen Gao,et al.  Compact Descriptors for Visual Search , 2014, IEEE MultiMedia.

[32]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[33]  Rui Caseiro,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence High-speed Tracking with Kernelized Correlation Filters , 2022 .

[34]  Christoph H. Lampert Detecting objects in large image collections and videos by efficient subimage retrieval , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Kunio Kashino,et al.  NTT Communication Science Laboratories and NII in TRECVID 2010 Instance Search Task , 2010, TRECVID.

[36]  Yuning Jiang,et al.  Randomized Spatial Context for Object Search , 2015, IEEE Transactions on Image Processing.

[37]  Huizhong Chen,et al.  Efficient video search using image queries , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[38]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[41]  Hervé Glotin,et al.  IRIM at TRECVID 2014: Semantic Indexing and Instance Search , 2014, TRECVID.

[42]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[43]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Junsong Yuan,et al.  Optimal spatio-temporal path discovery for video event detection , 2011, CVPR 2011.

[46]  Yuning Jiang,et al.  Interactive visual object search through mutual information maximization , 2010, ACM Multimedia.

[47]  Bernt Schiele,et al.  How good are detection proposals, really? , 2014, BMVC.

[48]  Xian-Sheng Hua,et al.  Online Multimedia Advertising: Techniques and Technologies , 2010 .