MatchNet: Unifying feature and metric learning for patch-based matching

Motivated by recent successes on learning feature representations and on learning feature comparison functions, we propose a unified approach to combining both for training a patch matching system. Our system, dubbed Match-Net, consists of a deep convolutional network that extracts features from patches and a network of three fully connected layers that computes a similarity between the extracted features. To ensure experimental repeatability, we train MatchNet on standard datasets and employ an input sampler to augment the training set with synthetic exemplar pairs that reduce overfitting. Once trained, we achieve better computational efficiency during matching by disassembling MatchNet and separately applying the feature computation and similarity networks in two sequential stages. We perform a comprehensive set of experiments on standard datasets to carefully study the contributions of each aspect of MatchNet, with direct comparisons to established methods. Our results confirm that our unified approach improves accuracy over previous state-of-the-art results on patch matching datasets, while reducing the storage requirement for descriptors. We make pre-trained MatchNet publicly available.

[1]  Matthew A. Brown,et al.  Automatic Panoramic Image Stitching using Invariant Features , 2007, International Journal of Computer Vision.

[2]  Vincent Lepetit,et al.  Boosting Binary Keypoint Descriptors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[4]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Yan Ke,et al.  PCA-SIFT: a more distinctive representation for local image descriptors , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[8]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[9]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[10]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[11]  Vincent Lepetit,et al.  Learning Image Descriptors with the Boosting-Trick , 2012, NIPS.

[12]  Dumitru Erhan,et al.  Scalable Object Detection Using Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[14]  Matthew A. Brown,et al.  Picking the best DAISY , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Gregory Shakhnarovich,et al.  Learning task-specific similarity , 2005 .

[16]  Gary R. Bradski,et al.  A codebook-free and annotation-free approach for fine-grained image categorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Yann LeCun,et al.  Computing the stereo matching cost with a convolutional neural network , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andrew Zisserman,et al.  Learning Local Feature Descriptors Using Convex Optimisation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Trevor Darrell,et al.  Heavy-tailed Distances for Gradient Based Image Descriptors , 2011, NIPS.

[20]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Inderjit S. Dhillon,et al.  Metric and Kernel Learning Using a Linear Transformation , 2009, J. Mach. Learn. Res..

[22]  Pascal Fua,et al.  LDAHash: Improved Matching with Smaller Descriptors , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[24]  Richard Szeliski,et al.  A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Gang Hua,et al.  Discriminative Learning of Local Image Descriptors , 1990, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[29]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[30]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[32]  Ian D. Reid,et al.  Locally Planar Patch Features for Real-Time Structure from Motion , 2004, BMVC.

[33]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[34]  Jan-Michael Frahm,et al.  Comparative Evaluation of Binary Features , 2012, ECCV.