Searching visual instances with topology checking and context modeling

Instance Search (INS) is a realistic problem initiated by TRECVID, which is to retrieve all occurrences of the querying object, location, or person from a large video collection. It is a fundamental problem with many applications, and also a challenging problem different from the traditional concept or near-duplicate (ND) search, since the relevancy is defined at instance level. True responses could exhibit various visual variations, such as being small on the image with different background, or showing a non-homography spatial configuration. Based on the Bag-of-Words model, we propose two techniques tailored for Instance Search. Specifically, we explore the use of (1) an elastic spatial topology checking technique based on Delaunay Triangulation (DT), and (2) a practical background context modeling method by simulating the "stare" behavior of human eyes. With DT, we improve the quality of visual matching by accumulating evidence from local topology-preserving patches, significantly boosting the ranks of topology consistent results. On the other hand, we increase the information quantity for visual matching with the "stare" model, such that instances appearing in both similar and different background can be highly ranked as results. The proposed techniques are evaluated on the INS datasets of TRECVID, achieving large performance gain with small computation overhead, compared with several existing methods.

[1]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[3]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Shin'ichi Satoh,et al.  Large vocabulary quantization for searching instances from videos , 2012, ICMR '12.

[6]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Chong-Wah Ngo,et al.  On the Annotation of Web Videos by Efficient Near-Duplicate Search , 2010, IEEE Transactions on Multimedia.

[8]  Michael Isard,et al.  Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Michael Isard,et al.  Bundling features for large scale partial-duplicate web image search , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[11]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[12]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[13]  Jiri Matas,et al.  Large-Scale Discovery of Spatially Related Images , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[15]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[16]  Tsuhan Chen,et al.  Image retrieval with geometry-preserving visual phrases , 2011, CVPR 2011.

[17]  Michael Isard,et al.  General Theory , 1969 .

[18]  Chong-Wah Ngo,et al.  Snap-and-ask: answering multimodal question by naming visual instance , 2012, ACM Multimedia.

[19]  Andrew Zisserman,et al.  Efficient image retrieval for 3D structures , 2010, BMVC.

[20]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[22]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.