Visual landmark recognition from Internet photo collections: A large-scale evaluation

Abstract The task of a visual landmark recognition system is to identify photographed buildings or objects in query photos and to provide the user with relevant information on them. With their increasing coverage of the world’s landmark buildings and objects, Internet photo collections are now being used as a source for building such systems in a fully automatic fashion. This process typically consists of three steps: clustering large amounts of images by the objects they depict; determining object names from user-provided tags; and building a robust, compact, and efficient recognition index. To this date, however, there is little empirical information on how well current approaches for those steps perform in a large-scale open-set mining and recognition task. Furthermore, there is little empirical information on how recognition performance varies for different types of landmark objects and where there is still potential for improvement. With this paper, we intend to fill these gaps. Using a dataset of 500 k images from Paris, we analyze each component of the landmark recognition pipeline in order to answer the following questions: How many and what kinds of objects can be discovered automatically? How can we best use the resulting image clusters to recognize the object in a query? How can the object be efficiently represented in memory for recognition? How reliably can semantic information be extracted? And finally: What are the limiting factors in the resulting pipeline from query to semantics? We evaluate how different choices of methods and parameters for the individual pipeline steps affect overall system performance and examine their effects for different query categories such as buildings, paintings or sculptures.

[1]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.

[2]  Jan-Michael Frahm,et al.  Modeling and Recognition of Landmark Image Collections Using Iconic Scene Graphs , 2008, ECCV.

[3]  Tomás Pajdla,et al.  Avoiding Confusing Features in Place Recognition , 2010, ECCV.

[4]  Jan-Michael Frahm,et al.  Modeling and Recognition of Landmark Image Collections Using Iconic Scene Graphs , 2008, International Journal of Computer Vision.

[5]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[7]  Guang-Zhong Yang,et al.  From images to scenes: Compressing an image cluster into a single scene model for place recognition , 2011, 2011 International Conference on Computer Vision.

[8]  Luc Van Gool,et al.  Size Does Matter: Improving Object Recognition and 3D Reconstruction with Cross-Media Analysis of Image Clusters , 2010, ECCV.

[9]  Michael S. Lew,et al.  ACM International Conference on Multimedia Retrieval (ICMR): current standing and impact , 2017, International Journal of Multimedia Information Retrieval.

[10]  Jon M. Kleinberg,et al.  Mapping the world's photos , 2009, WWW '09.

[11]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Xin Chen,et al.  City-scale landmark identification on mobile devices , 2011, CVPR 2011.

[13]  Yang Song,et al.  Tour the world: Building a web-scale landmark recognition engine , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Andrew Zisserman,et al.  Object Mining Using a Matching Graph on Very Large Image Collections , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[15]  Steven M. Seitz,et al.  Scene Summarization for Online Image Collections , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Torsten Sattler,et al.  Fast image-based localization using direct 2D-to-3D matching , 2011, 2011 International Conference on Computer Vision.

[17]  Lorenzo Torresani,et al.  Leveraging Structure from Motion to Learn Discriminative Codebooks for Scalable Landmark Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Yannis Avrithis,et al.  VIRaL: Visual Image Retrieval and Localization , 2010, Multimedia Tools and Applications.

[19]  Pascal Fua,et al.  Worldwide Pose Estimation Using 3D Point Clouds , 2012, ECCV.

[20]  Noah Snavely,et al.  Graph-Based Discriminative Learning for Location Recognition , 2013, International Journal of Computer Vision.

[21]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Bastian Leibe,et al.  Discovering favorite views of popular places with iconoid shift , 2011, 2011 International Conference on Computer Vision.

[23]  Tomás Pajdla,et al.  Learning and Calibrating Per-Location Classifiers for Visual Place Recognition , 2013, CVPR.

[24]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Yannis Avrithis,et al.  Retrieving landmark and non-landmark images from community photo collections , 2010, ACM Multimedia.

[26]  Richard Szeliski,et al.  City-Scale Location Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[29]  Noah Snavely,et al.  Minimal Scene Descriptions from Structure from Motion Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Andrew Zisserman,et al.  Name that sculpture , 2012, ICMR.

[31]  Olivier Buisson,et al.  Scalable mining of small visual objects , 2012, ACM Multimedia.

[32]  Bernhard Schölkopf,et al.  A Kernel Approach for Vector Quantization with Guaranteed Distortion Bounds , 2001, AISTATS.

[33]  Jan-Michael Frahm,et al.  Building Rome on a Cloudless Day , 2010, ECCV.

[34]  Wen Gao,et al.  Location Discriminative Vocabulary Coding for Mobile Landmark Search , 2011, International Journal of Computer Vision.

[35]  Chengcui Zhang,et al.  Efficient place recognition with canonical views , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[36]  Marc Pollefeys,et al.  Handling Urban Location Recognition as a 2D Homothetic Problem , 2010, ECCV.

[37]  Horst Bischof,et al.  From structure-from-motion point clouds to fast location recognition , 2009, CVPR.

[38]  Michael Isard,et al.  Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[39]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Luc Van Gool,et al.  I know what you did last summer: object-level auto-annotation of holiday snaps , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[42]  Takeo Kanade,et al.  Mode-seeking by Medoidshifts , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[43]  Luc Van Gool,et al.  World-scale mining of objects and events from community photo collections , 2008, CIVR '08.

[44]  Daniel P. Huttenlocher,et al.  Landmark classification in large-scale image collections , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[45]  Jan-Michael Frahm,et al.  From structure-from-motion point clouds to fast location recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Tomás Pajdla,et al.  Visual localization by linear combination of image descriptors , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[47]  Torsten Sattler,et al.  Image Retrieval for Image-Based Localization Revisited , 2012, BMVC.

[48]  Panu Turcot,et al.  Better matching with fewer features: The selection of useful features in large database recognition problems , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[49]  O. Chum,et al.  Geometric min-Hashing: Finding a (thick) needle in a haystack , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Jiri Matas,et al.  Large-Scale Discovery of Spatially Related Images , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Torsten Sattler,et al.  SCRAMSAC: Improving RANSAC's efficiency with a spatial consistency filter , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[52]  Daniel P. Huttenlocher,et al.  Location Recognition Using Prioritized Feature Matching , 2010, ECCV.

[53]  Bastian Leibe,et al.  An Evaluation of Two Automatic Landmark Building Discovery Algorithms for City Reconstruction , 2010, ECCV Workshops.

[54]  Cordelia Schmid,et al.  Packing bag-of-features , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[55]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[56]  Guang-Zhong Yang,et al.  Generative Methods for Long-Term Place Recognition in Dynamic Scenes , 2013, International Journal of Computer Vision.

[57]  Torsten Sattler,et al.  Improving Image-Based Localization by Active Correspondence Search , 2012, ECCV.

[58]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[59]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).