PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion

The Transformer architecture has achieved rapid development in recent years, outperforming the CNN architectures in many computer vision tasks, such as the Vision Transformers (ViT) for image classification. However, existing visual transformer models aim to extract semantic information for high-level tasks such as classification and detection, distorting the spatial resolution of the input image, without the capacity in reconstructing the input or generating high-resolution image. In this paper, therefore, we propose a Patch Pyramid Transformer(PPT) to effectively address the above issues. Specifically, we first design a Patch Transformer to transform the image into a sequence of patches, where transformer encoding is performed for each patch to extract local representations. In addition, we construct a Pyramid Transformer to effectively extract the nonlocal information from the entire image. After obtaining a set of multi-scale, multi-dimensional, and multi-angle features of the original image, we design the image reconstruction network to ensure that the features can be reconstructed into the input image. To validate the effectiveness, we apply the proposed Patch Pyramid Transformer to the image fusion task. The experimental results demonstrate its superior performance against the stateof-the-art fusion approaches, achieving the best results on several evaluation indicators. The underlying capacity of the PPT network is reflected by its universal power in feature extraction and image reconstruction, which can be directly applied to different image fusion tasks without redesigning or retraining the network.

[1]  Tariq S. Durrani,et al.  Image fusion based on generative adversarial network consistent with perception , 2021, Inf. Fusion.

[2]  Shutao Li,et al.  Image Fusion With Guided Filtering , 2013, IEEE Transactions on Image Processing.

[3]  Luciano Alparone,et al.  Remote sensing image fusion using the curvelet transform , 2007, Inf. Fusion.

[4]  Hui Li,et al.  DenseFuse: A Fusion Approach to Infrared and Visible Images , 2018, IEEE Transactions on Image Processing.

[5]  Kai Zeng,et al.  Perceptual Quality Assessment for Multi-Exposure Image Fusion , 2015, IEEE Transactions on Image Processing.

[6]  Wei Gao,et al.  Image fusion based on non-negative matrix factorization and infrared feature extraction , 2013, 2013 6th International Congress on Image and Signal Processing (CISP).

[7]  T. Durrani,et al.  NestFuse: An Infrared and Visible Image Fusion Architecture Based on Nest Connection and Spatial/Channel Attention Models , 2020, IEEE Transactions on Instrumentation and Measurement.

[8]  Rick S. Blum,et al.  A categorization of multiscale-decomposition-based image fusion schemes with a performance study for a digital camera application , 1999, Proc. IEEE.

[9]  Yu Liu,et al.  IFCNN: A general image fusion framework based on convolutional neural network , 2020, Inf. Fusion.

[10]  Yutong Lin,et al.  Self-Supervised Learning with Swin Transformers , 2021, ArXiv.

[11]  Shutao Li,et al.  Pixel-level image fusion: A survey of the state of the art , 2017, Inf. Fusion.

[12]  Zihao Wang,et al.  Attention for Image Registration (AiR): an unsupervised Transformer approach , 2021, ArXiv.

[13]  Tianshuang Qiu,et al.  Medical image fusion based on sparse representation of classified image patches , 2017, Biomed. Signal Process. Control..

[14]  Jiayi Ma,et al.  Infrared and visible image fusion methods and applications: A survey , 2018, Inf. Fusion.

[15]  Vps Naidu,et al.  Image Fusion Technique using Multi-resolution Singular Value Decomposition , 2011 .

[16]  Alexander Toet,et al.  Image fusion by a ration of low-pass pyramid , 1989, Pattern Recognit. Lett..

[17]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[18]  Toet Alexander,et al.  TNO Image Fusion Dataset , 2014 .

[19]  B. K. Shreyamsha Kumar,et al.  Image fusion based on pixel significance using cross bilateral filter , 2013, Signal, Image and Video Processing.

[20]  R. Venkatesh Babu,et al.  DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[22]  Gao Huang,et al.  3D Object Detection with Pointformer , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Hui Li,et al.  MDLatLRR: A Novel Decomposition Method for Infrared and Visible Image Fusion , 2018, IEEE Transactions on Image Processing.

[24]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Mohammad Haghighat,et al.  Fast-FMI: Non-reference image fusion metric , 2014, 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT).

[26]  Cordelia Schmid,et al.  Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Josef Kittler,et al.  SiT: Self-supervised vIsion Transformer , 2021, ArXiv.

[28]  S. S. Hana,et al.  THE STUDY ON IMAGE FUSION FOR HIGH SPATIAL RESOLUTION REMOTE SENSING IMAGES , 2008 .

[29]  Dan Pei,et al.  Personalized re-ranking for recommendation , 2019, RecSys.

[30]  Alan C. Bovik,et al.  Image information and visual quality , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Sun Li,et al.  Perceptual fusion of infrared and visible images through a hybrid multi-scale decomposition with Gaussian and bilateral filters , 2016, Inf. Fusion.

[32]  Jian Tang,et al.  AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks , 2018, CIKM.

[33]  Wei Li,et al.  Behavior sequence transformer for e-commerce recommendation in Alibaba , 2019, Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data.

[34]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[35]  Lei Zhang,et al.  Learning a Deep Single Image Contrast Enhancer from Multi-Exposure Images , 2018, IEEE Transactions on Image Processing.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[38]  Claudio Gennaro,et al.  Learning Pedestrian Detection from Virtual Worlds , 2019, ICIAP.

[39]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[40]  B. K. Shreyamsha Kumar,et al.  Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform , 2013, Signal Image Video Process..

[41]  V. Aslantaş,et al.  A new image quality metric for image fusion: The sum of the correlations of differences , 2015 .

[42]  Yu Fu,et al.  A Dual-Branch Network for Infrared and Visible Image Fusion , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[43]  Cedric Nishan Canagarajah,et al.  Region-Based Multimodal Image Fusion Using ICA Bases , 2007, IEEE Sensors Journal.

[44]  Masayuki Inaba,et al.  TrTr: Visual Tracking with Transformer , 2021, ArXiv.

[45]  Yang Chao,et al.  Efficient image fusion with approximate sparse representation , 2016, Int. J. Wavelets Multiresolution Inf. Process..

[46]  J. Wesley Roberts,et al.  Assessment of image fusion procedures using entropy, image quality, and multispectral classification , 2008 .

[47]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[48]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[49]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[50]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[51]  Junjun Jiang,et al.  FusionGAN: A generative adversarial network for infrared and visible image fusion , 2019, Inf. Fusion.

[52]  Yu Liu,et al.  A general framework for image fusion based on multi-scale transform and sparse representation , 2015, Inf. Fusion.

[53]  Hao Zhang,et al.  Rethinking the Image Fusion: A Fast Unified Image Fusion Network based on Proportional Maintenance of Gradient and Intensity , 2020, AAAI.

[54]  Gang Liu,et al.  Multi-sensor image fusion based on fourth order partial differential equations , 2017, 2017 20th International Conference on Information Fusion (Fusion).

[55]  Zheng Liu,et al.  Objective Assessment of Multiresolution Image Fusion Algorithms for Context Enhancement in Night Vision: A Comparative Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  David Bull,et al.  Image fusion metric based on mutual information and Tsallis entropy , 2006 .

[57]  Jan Kautz,et al.  Exposure Fusion , 2009, 15th Pacific Conference on Computer Graphics and Applications (PG'07).

[58]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[59]  Haifeng Li,et al.  Dictionary learning method for joint sparse representation-based image fusion , 2013 .

[60]  Yu Han,et al.  A new image fusion performance metric based on visual information fidelity , 2013, Inf. Fusion.

[61]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[62]  Cedric Nishan Canagarajah,et al.  Pixel- and region-based image fusion with complex wavelets , 2007, Inf. Fusion.

[63]  Shadrokh Samavi,et al.  Multi-focus image fusion using dictionary-based sparse representation , 2015, Inf. Fusion.

[64]  Xiaojie Guo,et al.  U2Fusion: A Unified Unsupervised Image Fusion Network , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Wei Yu,et al.  Infrared and visible image fusion via detail preserving adversarial learning , 2020, Inf. Fusion.

[66]  Hui Li,et al.  Fast Multi-Scale Structural Patch Decomposition for Multi-Exposure Image Fusion , 2020, IEEE Transactions on Image Processing.

[67]  Hui Li,et al.  Multi-focus Image Fusion Using Dictionary Learning and Low-Rank Representation , 2017, ICIG.

[68]  Xiaohua Qiu,et al.  Guided filter-based multi-focus image fusion through focus region detection , 2019, Signal Process. Image Commun..

[69]  Yu Liu,et al.  Multi-focus image fusion with a deep convolutional neural network , 2017, Inf. Fusion.

[70]  Bohn Stafleu van Loghum,et al.  Online … , 2002, LOG IN.

[71]  Kishor P. Upla,et al.  An Edge Preserving Multiresolution Fusion: Use of Contourlet Transform and MRF Prior , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[72]  Kurt Keutzer,et al.  Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[73]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[74]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[75]  Junjun Jiang,et al.  FusionDN: A Unified Densely Connected Network for Image Fusion , 2020, AAAI.

[76]  Zheng Liu,et al.  PERFORMANCE ASSESSMENT OF COMBINATIVE PIXEL-LEVEL IMAGE FUSION BASED ON AN ABSOLUTE FEATURE MEASUREMENT , 2007 .

[77]  Jiayi Ma,et al.  Infrared and visible image fusion via gradient transfer and total variation minimization , 2016, Inf. Fusion.