AutoScaler: Scale-Attention Networks for Visual Correspondence

Finding visual correspondence between local features is key to many computer vision problems. While defining features with larger contextual scales usually implies greater discriminativeness, it could also lead to less spatial accuracy of the features. We propose AutoScaler, a scale-attention network to explicitly optimize this trade-off in visual correspondence tasks. Our network consists of a weight-sharing feature network to compute multi-scale feature maps and an attention network to combine them optimally in the scale space. This allows our network to have adaptive receptive field sizes over different scales of the input. The entire network is trained end-to-end in a siamese framework for visual correspondence tasks. Our method achieves favorable results compared to state-of-the-art methods on challenging optical flow and semantic matching benchmarks, including Sintel, KITTI and CUB-2011. We also show that our method can generalize to improve hand-crafted descriptors (e.g Daisy) on general visual correspondence tasks. Finally, our attention network can generate visually interpretable scale attention maps.

[1]  David W. Jacobs,et al.  WarpNet: Weakly Supervised Matching for Single-View Reconstruction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Trevor Darrell,et al.  Do Convnets Learn Correspondence? , 2014, NIPS.

[3]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Andreas Geiger,et al.  Deep Discrete Flow , 2016, ACCV.

[7]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Andreas Geiger,et al.  Visual odometry based on stereo image sequences with RANSAC-based outlier rejection scheme , 2010, 2010 IEEE Intelligent Vehicles Symposium.

[9]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[10]  Minh N. Do,et al.  Fast Guided Global Interpolation for Depth and Motion , 2016, ECCV.

[11]  Vladlen Koltun,et al.  Full Flow: Optical Flow Estimation By Global Optimization over Regular Grids , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Michael J. Black,et al.  Optical Flow with Semantic Segmentation and Localized Layers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Adam Finkelstein,et al.  The Generalized PatchMatch Correspondence Algorithm , 2010, ECCV.

[15]  Lihi Zelnik-Manor,et al.  SIFTing Through Scales , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[17]  Michael J. Black,et al.  On the Spatial Statistics of Optical Flow , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[18]  Didier Stricker,et al.  CNN based Patch Matching for Optical Flow with Thresholded Hinge Loss , 2016, ArXiv.

[19]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[20]  Michael J. Black,et al.  A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[21]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[22]  Rahul Sukthankar,et al.  MatchNet: Unifying feature and metric learning for patch-based matching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[24]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Michael J. Black,et al.  Optical Flow in Mostly Rigid Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Richard Szeliski,et al.  Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[28]  Lihi Zelnik-Manor,et al.  On SIFTs and their scales , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[30]  Tony Lindeberg,et al.  Feature Detection with Automatic Scale Selection , 1998, International Journal of Computer Vision.

[31]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Nikos Komodakis,et al.  Learning to compare image patches via convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[34]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[36]  Pushmeet Kohli,et al.  Fusion4D , 2016, ACM Trans. Graph..

[37]  Mei Han,et al.  Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Didier Stricker,et al.  Flow Fields: Dense Correspondence Fields for Highly Accurate Large Displacement Optical Flow Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Yasuo Kuniyoshi,et al.  Discriminative spatial pyramid , 2011, CVPR 2011.

[40]  Christian Heipke,et al.  Discrete Optimization for Optical Flow , 2015, GCPR.

[41]  Thomas Brox,et al.  Combinatorial Regularization of Descriptor Matching for Optical Flow Estimation , 2015, BMVC.

[42]  Raquel Urtasun,et al.  Efficient Deep Learning for Stereo Matching , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Jia Xu,et al.  Accurate Optical Flow via Direct Cost Volume Processing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Min Bai,et al.  Exploiting Semantic Information and Deep Matching for Optical Flow , 2016, ECCV.

[47]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[48]  Cordelia Schmid,et al.  EpicFlow: Edge-preserving interpolation of correspondences for optical flow , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Silvio Savarese,et al.  Universal Correspondence Network , 2016, NIPS.

[51]  Didier Stricker,et al.  Supplementary material of : CNN-based Patch Matching for Optical Flow with Thresholded Hinge Embedding Loss , 2017 .

[52]  Vincent Lepetit,et al.  Learning Image Descriptors with the Boosting-Trick , 2012, NIPS.

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Vincent Lepetit,et al.  DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Andrew Zisserman,et al.  Learning Local Feature Descriptors Using Convex Optimisation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Liang Wang,et al.  A Deep Visual Correspondence Embedding Model for Stereo Matching Costs , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[57]  Pushmeet Kohli,et al.  The Global Patch Collider , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Yann LeCun,et al.  Computing the stereo matching cost with a convolutional neural network , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Zhuowen Tu,et al.  Scale-Space SIFT flow , 2014, IEEE Winter Conference on Applications of Computer Vision.

[60]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.