BoCNF: efficient image matching with Bag of ConvNet features for scalable and robust visual place recognition

Recent advances in visual place recognition (VPR) have exploited ConvNet features to improve the recognition accuracy under significant environmental and viewpoint changes. However, it remains unsolved how to implement efficient image matching with high dimensional ConvNet features. In this paper, we tackle the problem of matching efficiency using ConvNet features for VPR, where the task is to accurately and quickly recognize a given place in large-scale challenging environments. The paper makes two contributions. First, we propose an efficient solution to VPR, based on the well-known bag-of-words (BoW) framework, to speed up image matching with ConvNet features. Second, in order to alleviate the problem of perceptual aliasing in BoW, we adopt a coarse-to-fine approach where we first, in the coarse stage, search for the top-K candidate images via BoW and then, in the fine stage, identify the best match among the candidates using a hash-based voting scheme. We conduct extensive experiments on six popular VPR datasets to validate the effectiveness of our method. Experimental results show that, in terms of recognition accuracy, our method is comparable to linear search, and outperforms other methods such as FABMAP and SeqSLAM by a significant margin. In terms of efficiecy, our method achieves a significant speed-up over linear search, with an average matching time as low as 23.5 ms per query on a dataset with 21K images.

[1]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[2]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Michael Milford,et al.  Place Recognition with ConvNet Landmarks: Viewpoint-Robust, Condition-Robust, Training-Free , 2015, Robotics: Science and Systems.

[4]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[5]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[6]  Michael Milford,et al.  Vision-based place recognition: how low can you go? , 2013, Int. J. Robotics Res..

[7]  Qi Tian,et al.  SIFT Meets CNN: A Decade Survey of Instance Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jana Kosecka,et al.  Probabilistic location recognition using reduced feature set , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[9]  Hong Zhang,et al.  Towards improving the efficiency of sequence-based SLAM , 2013, 2013 IEEE International Conference on Mechatronics and Automation.

[10]  Victor S. Lempitsky,et al.  Aggregating Deep Convolutional Features for Image Retrieval , 2015, ArXiv.

[11]  Jana Kosecka,et al.  Vision based topological Markov localization , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[12]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[13]  Niko Sünderhauf,et al.  BRIEF-Gist - closing the loop by simple means , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[14]  Paul Newman,et al.  Appearance-only SLAM at large scale with FAB-MAP 2.0 , 2011, Int. J. Robotics Res..

[15]  Wolfram Burgard,et al.  Robust Visual Robot Localization Across Seasons Using Network Flows , 2014, AAAI.

[16]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Shilin Zhou,et al.  Convolutional neural network-based image representation for visual loop closure detection , 2015, 2015 IEEE International Conference on Information and Automation.

[18]  Peter I. Corke,et al.  Visual Place Recognition: A Survey , 2016, IEEE Transactions on Robotics.

[19]  Hua Wang,et al.  Robust Multimodal Sequence-Based Loop Closure Detection via Structured Sparsity , 2016, Robotics: Science and Systems.

[20]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Gordon Wyeth,et al.  SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights , 2012, 2012 IEEE International Conference on Robotics and Automation.

[23]  Michael Milford,et al.  Convolutional Neural Network-based Place Recognition , 2014, ICRA 2014.

[24]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[25]  Simon Osindero,et al.  Cross-Dimensional Weighting for Aggregated Deep Convolutional Features , 2015, ECCV Workshops.

[26]  Gordon Wyeth,et al.  FAB-MAP + RatSLAM: Appearance-based SLAM for multiple times of day , 2010, 2010 IEEE International Conference on Robotics and Automation.

[27]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[28]  Peter I. Corke,et al.  All-environment visual place recognition with SMART , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[30]  Niko Sünderhauf,et al.  On the performance of ConvNet features for place recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[31]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[32]  Yang Liu,et al.  Visual loop closure detection with a compact image descriptor , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[33]  Niko Sünderhauf,et al.  Are We There Yet? Challenging SeqSLAM on a 3000 km Journey Across All Four Seasons , 2013 .

[34]  Bernt Schiele,et al.  What Makes for Effective Detection Proposals? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yang Liu,et al.  Keypoint matching by outlier pruning with consensus constraint , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[36]  Gordon Wyeth,et al.  OpenFABMAP: An open source toolbox for appearance-based loop closure detection , 2012, 2012 IEEE International Conference on Robotics and Automation.

[37]  Gautam Singh Visual Loop Closing using Gist Descriptors in Manhattan World , 2010 .

[38]  Huanxin Zou,et al.  Efficient ConvNet Feature Extraction with Multiple RoI Pooling for Landmark-Based Visual Localization of Autonomous Vehicles , 2017, Mob. Inf. Syst..

[39]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Hong Zhang,et al.  BoRF: Loop-closure detection with scale invariant visual features , 2011, 2011 IEEE International Conference on Robotics and Automation.

[41]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[42]  Peer Neubert,et al.  Local region detector + CNN based landmarks for practical place recognition in changing environments , 2015, 2015 European Conference on Mobile Robots (ECMR).

[43]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.