MatteFormer: Transformer-Based Image Matting via Prior-Tokens

In this paper, we propose a transformer-based image matting model called MatteFormer, which takes full advantage of trimap information in the transformer block. Our method first introduces a prior-token which is a global representation of each trimap region (e.g. foreground, background and unknown). These prior-tokens are used as global priors and participate in the self-attention mechanism of each block. Each stage of the encoder is composed of PAST (Prior-Attentive Swin Transformer) block, which is based on the Swin Transformer block, but differs in a couple of aspects: 1) It has PA-WSA (Prior-Attentive Window Self-Attention) layer, performing self-attention not only with spatial-tokens but also with prior-tokens. 2) It has prior-memory which saves prior-tokens accumulatively from the previous blocks and transfers them to the next block. We evaluate our MatteFormer on the commonly used image matting datasets: Composition-Ik and Distinctions-646. Experiment results show that our proposed method achieves state-of-the-art performance with a large margin. Our codes are available at https://github.com/webtoon/matteformer.

[1]  Xiangyu Zhang,et al.  Anchor DETR: Query Design for Transformer-Based Detector , 2022, AAAI.

[2]  Luc Van Gool,et al.  SwinIR: Image Restoration Using Swin Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[3]  Nenghai Yu,et al.  CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Lu Yuan,et al.  Focal Self-attention for Local-Global Interactions in Vision Transformers , 2021, ArXiv.

[5]  A. Piergiovanni,et al.  TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? , 2021, ArXiv.

[6]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[7]  Chi-Keung Tang,et al.  Semantic Image Matting , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Luc Van Gool,et al.  LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[9]  Lu Yuan,et al.  Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Jing Liao,et al.  High-Fidelity Pluralistic Image Completion with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Jonathon Shlens,et al.  Scaling Local Self-Attention for Parameter Efficient Visual Backbones , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[14]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Francis E. H. Tay,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Xudong Jiang,et al.  Towards Enhancing Fine-grained Details for Image Matting , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[19]  Eric Tzeng,et al.  Toward Transformer-Based Object Detection , 2020, ArXiv.

[20]  Ira Kemelmacher-Shlizerman,et al.  Real-Time High-Resolution Background Matting , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ning Xu,et al.  Mask Guided Matting via Progressive Refinement Network , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ning Xu,et al.  High-Resolution Deep Image Matting , 2020, AAAI.

[24]  Kurt Keutzer,et al.  Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[25]  Yu Qiao,et al.  Attention-Guided Hierarchical Structure Aggregation for Image Matting , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[27]  Ira Kemelmacher-Shlizerman,et al.  Background Matting: The World Is Your Green Screen , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Hongtao Lu,et al.  Natural Image Matting via Guided Contextual Attention , 2020, AAAI.

[29]  Feng Liu,et al.  Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Jiangyu Liu,et al.  Disentangled Image Matting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Hao Lu,et al.  Indices Matter: Learning to Index for Deep Image Matting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Hujun Bao,et al.  A Late Fusion CNN for Digital Matting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jingwei Tang,et al.  Learning-Based Sampling for Natural Image Matting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Aljoscha Smolic,et al.  AlphaGAN: Generative adversarial networks for natural image matting , 2018, BMVC.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Ning Xu,et al.  Deep Image Matting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[40]  Deepu Rajan,et al.  Improving Image Matting Using Comprehensive Sampling Sets , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Chi-Keung Tang,et al.  KNN Matting , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Ying Wu,et al.  Nonlocal matting , 2011, CVPR 2011.

[43]  Jian Sun,et al.  A global sampling method for alpha matting , 2011, CVPR 2011.

[44]  Jian Sun,et al.  Fast matting using large kernel matting Laplacian matrices , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Manuel Menezes de Oliveira Neto,et al.  Shared Sampling for Real‐Time Alpha Matting , 2010, Comput. Graph. Forum.

[46]  Yuanjie Zheng,et al.  Learning based digital matting , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[47]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Michael F. Cohen,et al.  Optimized Color Sampling for Robust Matting , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Dani Lischinski,et al.  Spectral Matting , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Dani Lischinski,et al.  A Closed-Form Solution to Natural Image Matting , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  David Salesin,et al.  A Bayesian approach to digital matting , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[52]  Zhisheng Lu,et al.  Efficient Transformer for Single Image Super-Resolution , 2021, ArXiv.

[53]  Bo Zhang,et al.  Do We Really Need Explicit Position Encodings for Vision Transformers? , 2021, ArXiv.

[54]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Jiake Xie,et al.  Supplementary Material for the Paper: Tripartite Information Mining and Integration for Image Matting , 2021 .

[56]  L. Gool,et al.  Transformer in Convolutional Neural Networks , 2021, ArXiv.

[57]  Jiaya Jia,et al.  To appear in , 2004 .

[58]  Christopher K. I. Williams,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) The PASCAL Visual Object Classes (VOC) Challenge , 2022 .