Hierarchical Metric Learning and Matching for 2D and 3D Geometric Correspondences

Interest point descriptors have fueled progress on almost every problem in computer vision. Recent advances in deep neural networks have enabled task-specific learned descriptors that outperform hand-crafted descriptors on many problems. We demonstrate that commonly used metric learning approaches do not optimally leverage the feature hierarchies learned in a Convolutional Neural Network (CNN), especially when applied to the task of geometric feature matching. While a metric loss applied to the deepest layer of a CNN, is often expected to yield ideal features irrespective of the task, in fact the growing receptive field as well as striding effects cause shallower features to be better at high precision matching tasks. We leverage this insight together with explicit supervision at multiple levels of the feature hierarchy for better regularization, to learn more effective descriptors in the context of geometric matching tasks. Further, we propose to use activation maps at different layers of a CNN, as an effective and principled replacement for the multi-resolution image pyramids often used for matching tasks. We propose concrete CNN architectures employing these ideas, and evaluate them on multiple datasets for 2D and 3D geometric matching as well as optical flow, demonstrating state-of-the-art results and generalization across datasets.

[1]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[2]  Stefan Leutenegger,et al.  Semantic Texture for Robust Dense Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[3]  Vladlen Koltun,et al.  Robust reconstruction of indoor scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  K. Madhava Krishna,et al.  The Earth Ain't Flat: Monocular Reconstruction of Vehicles on Steep and Graded Roads from a Moving Camera , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[5]  Jiwen Lu,et al.  Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[7]  Takeo Kanade,et al.  Optical Navigation by the Method of Differences , 1985, IJCAI.

[8]  Pushmeet Kohli,et al.  The Global Patch Collider , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[10]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[11]  Tony Lindeberg,et al.  Feature Detection with Automatic Scale Selection , 1998, International Journal of Computer Vision.

[12]  Ning Zhang,et al.  AutoScaler: Scale-Attention Networks for Visual Correspondence , 2016, BMVC.

[13]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Michael J. Black,et al.  Optical Flow with Semantic Segmentation and Localized Layers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yung-Yu Chuang,et al.  DeepCD: Learning Deep Complementary Descriptors for Patch Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Adrien Bartoli,et al.  KAZE Features , 2012, ECCV.

[17]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Vladlen Koltun,et al.  Full Flow: Optical Flow Estimation By Global Optimization over Regular Grids , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Didier Stricker,et al.  Supplementary material of : CNN-based Patch Matching for Optical Flow with Thresholded Hinge Embedding Loss , 2017 .

[20]  Sen Wang,et al.  DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Min Bai,et al.  Exploiting Semantic Information and Deep Matching for Optical Flow , 2016, ECCV.

[22]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[23]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Cordelia Schmid,et al.  EpicFlow: Edge-preserving interpolation of correspondences for optical flow , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  Eric Brachmann,et al.  DSAC — Differentiable RANSAC for Camera Localization , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[28]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[29]  Ronan Collobert,et al.  Learning to Refine Object Segments , 2016, ECCV.

[30]  Roland Siegwart,et al.  BRISK: Binary Robust invariant scalable keypoints , 2011, 2011 International Conference on Computer Vision.

[31]  Michael S. Brown,et al.  SPM-BP: Sped-Up PatchMatch Belief Propagation for Continuous MRFs , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[33]  Yann LeCun,et al.  Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches , 2015, J. Mach. Learn. Res..

[34]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[35]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[38]  Lin Sun,et al.  Feedback Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Torsten Sattler,et al.  Comparative Evaluation of Hand-Crafted and Learned Local Features , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Trevor Darrell,et al.  Do Convnets Learn Correspondence? , 2014, NIPS.

[41]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[42]  Daniel D. Lee,et al.  Online self-supervised monocular visual odometry for ground vehicles , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Gregory D. Hager,et al.  Deep Supervision with Intermediate Concepts , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Nico Blodow,et al.  Fast Point Feature Histograms (FPFH) for 3D registration , 2009, 2009 IEEE International Conference on Robotics and Automation.

[45]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[47]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[48]  Vincent Lepetit,et al.  BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Silvio Savarese,et al.  Universal Correspondence Network , 2016, NIPS.

[51]  Andrea Vedaldi,et al.  HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Matthias Nießner,et al.  3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[55]  Shih-Fu Chang,et al.  Learning Spread-Out Local Feature Descriptors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[57]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[58]  Jason J. Corso,et al.  A Continuous Occlusion Model for Road Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[60]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[61]  Rachid Deriche,et al.  A Robust Technique for Matching two Uncalibrated Images Through the Recovery of the Unknown Epipolar Geometry , 1995, Artif. Intell..

[62]  Lior Wolf,et al.  PatchBatch: A Batch Augmented Loss for Optical Flow , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Vincent Lepetit,et al.  DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Gregory D. Hager,et al.  Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[66]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[68]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.