LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation

Cross-resolution image alignment is a key problem in multiscale gigapixel photography, which requires to estimate homography matrix using images with large resolution gap. Existing deep homography methods concatenate the input images or features, neglecting the explicit formulation of correspondences between them, which leads to degraded accuracy in cross-resolution challenges. In this paper, we consider the cross-resolution homography estimation as a multimodal problem, and propose a local transformer network embedded within a multiscale structure to explicitly learn correspondences between the multimodal inputs, namely, input images with different resolutions. The proposed local transformer adopts a local attention map specifically for each position in the feature. By combining the local transformer with the multiscale structure, the network is able to capture long-short range correspondences efficiently and accurately. Experiments on both the MS-COCO dataset and the real-captured cross-resolution dataset show that the proposed network outperforms existing state-of-the-art feature-based and deep-learning-based homography estimation methods, and is able to accurately align images under 10× resolution gap. * Equal contribution

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Sébastien Marcel,et al.  Torchvision the machine-vision package of torch , 2010, ACM Multimedia.

[3]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .

[4]  Georgios Paraskevopoulos,et al.  Multimodal and Multiresolution Speech Recognition with Transformers , 2020, ACL.

[5]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[6]  KochChristof,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 1998 .

[7]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Qilong Wang,et al.  ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[10]  Qionghai Dai,et al.  Multiscale-VR: Multiscale Gigapixel 3D Panoramic Videography for Virtual Reality , 2020, 2020 IEEE International Conference on Computational Photography (ICCP).

[11]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Ashok Veeraraghavan,et al.  Improving resolution and depth-of-field of light field cameras using a hybrid imaging system , 2014, 2014 IEEE International Conference on Computational Photography (ICCP).

[14]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[15]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[16]  Feng Liu,et al.  Deep Homography Estimation for Dynamic Scenes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Long Quan,et al.  ASLFeat: Learning Local Features of Accurate Shape and Localization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Kun Li,et al.  Cross-MPI: Cross-scale Stereo for Image Super-Resolution using Multiplane Images , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lei Zhou,et al.  ContextDesc: Local Descriptor Augmentation With Cross-Modality Context , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[21]  Jiri Matas,et al.  Repeatability Is Not Enough: Learning Affine Regions via Discriminability , 2017, ECCV.

[22]  Tomasz Malisiewicz,et al.  Deep Image Homography Estimation , 2016, ArXiv.

[23]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[24]  Lu Fang,et al.  Cross-Scale Reference-Based Light Field Super-Resolution , 2018, IEEE Transactions on Computational Imaging.

[25]  Lu Fang,et al.  CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping , 2018, ECCV.

[26]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[27]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[28]  David J. Brady,et al.  Multiscale gigapixel photography , 2012, Nature.

[29]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Jiri Matas,et al.  MAGSAC: Marginalizing Sample Consensus , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Sridha Sridharan,et al.  Rethinking Planar Homography Estimation Using Perspective Fields , 2018, ACCV.

[34]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Lei Zhou,et al.  GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints , 2018, ECCV.

[36]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[37]  Lu Fang,et al.  Learning Cross-scale Correspondence and Patch-based Synthesis for Reference-based Super-Resolution , 2017, BMVC.

[38]  Vijay Kumar,et al.  Unsupervised Deep Homography: A Fast and Robust Homography Estimation Model , 2017, IEEE Robotics and Automation Letters.

[39]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[40]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[41]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Edward Y. Chang,et al.  CLKN: Cascaded Lucas-Kanade Networks for Image Alignment , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jue Wang,et al.  Content-Aware Unsupervised Deep Homography Estimation , 2020, ECCV.

[44]  Pascal Fua,et al.  LF-Net: Learning Local Features from Images , 2018, NeurIPS.

[45]  Qionghai Dai,et al.  Multiscale gigapixel video: A cross resolution image matching and warping approach , 2017, 2017 IEEE International Conference on Computational Photography (ICCP).

[46]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Hairong Qi,et al.  Image Super-Resolution by Neural Texture Transfer , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  A. Schwing,et al.  Spatially Aware Multimodal Transformers for TextVQA , 2020, ECCV.

[50]  Zhou Yu,et al.  Multimodal Transformer With Multi-View Visual Representation for Image Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[51]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[52]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.