论文信息 - MatteFormer: Transformer-Based Image Matting via Prior-Tokens

MatteFormer: Transformer-Based Image Matting via Prior-Tokens

In this paper, we propose a transformer-based image matting model called MatteFormer, which takes full advantage of trimap information in the transformer block. Our method first introduces a prior-token which is a global representation of each trimap region (e.g. foreground, background and unknown). These prior-tokens are used as global priors and participate in the self-attention mechanism of each block. Each stage of the encoder is composed of PAST (Prior-Attentive Swin Transformer) block, which is based on the Swin Transformer block, but differs in a couple of aspects: 1) It has PA-WSA (Prior-Attentive Window Self-Attention) layer, performing self-attention not only with spatial-tokens but also with prior-tokens. 2) It has prior-memory which saves prior-tokens accumulatively from the previous blocks and transfers them to the next block. We evaluate our MatteFormer on the commonly used image matting datasets: Composition-Ik and Distinctions-646. Experiment results show that our proposed method achieves state-of-the-art performance with a large margin. Our codes are available at https://github.com/webtoon/matteformer.

[1] Xiangyu Zhang,et al. Anchor DETR: Query Design for Transformer-Based Detector , 2022, AAAI.

[2] Luc Van Gool,et al. SwinIR: Image Restoration Using Swin Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[3] Nenghai Yu,et al. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Lu Yuan,et al. Focal Self-attention for Local-Global Interactions in Vision Transformers , 2021, ArXiv.

[5] A. Piergiovanni,et al. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? , 2021, ArXiv.

[6] Anima Anandkumar,et al. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[7] Chi-Keung Tang,et al. Semantic Image Matting , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Luc Van Gool,et al. LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[9] Lu Yuan,et al. Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] N. Codella,et al. CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] Jing Liao,et al. High-Fidelity Pluralistic Image Completion with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12] Jonathon Shlens,et al. Scaling Local Self-Attention for Parameter Efficient Visual Backbones , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Enhua Wu,et al. Transformer in Transformer , 2021, NeurIPS.

[14] Xiang Li,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Francis E. H. Tay,et al. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Xudong Jiang,et al. Towards Enhancing Fine-grained Details for Image Matting , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17] Tao Xiang,et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[19] Eric Tzeng,et al. Toward Transformer-Based Object Detection , 2020, ArXiv.

[20] Ira Kemelmacher-Shlizerman,et al. Real-Time High-Resolution Background Matting , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Ning Xu,et al. Mask Guided Matting via Progressive Refinement Network , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Wen Gao,et al. Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Ning Xu,et al. High-Resolution Deep Image Matting , 2020, AAAI.

[24] Kurt Keutzer,et al. Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[25] Yu Qiao,et al. Attention-Guided Hierarchical Structure Aggregation for Image Matting , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[27] Ira Kemelmacher-Shlizerman,et al. Background Matting: The World Is Your Green Screen , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Hongtao Lu,et al. Natural Image Matting via Guided Contextual Attention , 2020, AAAI.

[29] Feng Liu,et al. Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30] Jiangyu Liu,et al. Disentangled Image Matting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Hao Lu,et al. Indices Matter: Learning to Index for Deep Image Matting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Hujun Bao,et al. A Late Fusion CNN for Digital Matting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Jingwei Tang,et al. Learning-Based Sampling for Natural Image Matting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Aljoscha Smolic,et al. AlphaGAN: Generative adversarial networks for natural image matting , 2018, BMVC.

[35] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[36] Ning Xu,et al. Deep Image Matting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[40] Deepu Rajan,et al. Improving Image Matting Using Comprehensive Sampling Sets , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Chi-Keung Tang,et al. KNN Matting , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42] Ying Wu,et al. Nonlocal matting , 2011, CVPR 2011.

[43] Jian Sun,et al. A global sampling method for alpha matting , 2011, CVPR 2011.

[44] Jian Sun,et al. Fast matting using large kernel matting Laplacian matrices , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45] Manuel Menezes de Oliveira Neto,et al. Shared Sampling for Real‐Time Alpha Matting , 2010, Comput. Graph. Forum.

[46] Yuanjie Zheng,et al. Learning based digital matting , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[47] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[48] Michael F. Cohen,et al. Optimized Color Sampling for Robust Matting , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[49] Dani Lischinski,et al. Spectral Matting , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[50] Dani Lischinski,et al. A Closed-Form Solution to Natural Image Matting , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51] David Salesin,et al. A Bayesian approach to digital matting , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[52] Zhisheng Lu,et al. Efficient Transformer for Single Image Super-Resolution , 2021, ArXiv.

[53] Bo Zhang,et al. Do We Really Need Explicit Position Encodings for Vision Transformers? , 2021, ArXiv.

[54] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[55] Jiake Xie,et al. Supplementary Material for the Paper: Tripartite Information Mining and Integration for Image Matting , 2021 .

[56] L. Gool,et al. Transformer in Convolutional Neural Networks , 2021, ArXiv.

[57] Jiaya Jia,et al. To appear in , 2004 .

[58] Christopher K. I. Williams,et al. International Journal of Computer Vision manuscript No. (will be inserted by the editor) The PASCAL Visual Object Classes (VOC) Challenge , 2022 .