Cloth Interactive Transformer for Virtual Try-On

2D image-based virtual try-on has attracted increased attention from the multimedia and computer vision communities. However, most of the existing image-based virtual try-on methods directly put both person and the inshop clothing representations together, without considering the mutual correlation between them. What is more, the long-range information, which is crucial for generating globally consistent results, is also hard to be established via the regular convolution operation. To alleviate these two problems, in this paper we propose a novel twostage Cloth Interactive Transformer (CIT) for virtual tryon. In the first stage, we design a CIT matching block, aiming to perform a learnable thin-plate spline transformation that can capture more reasonable long-range relation. As a result, the warped in-shop clothing looks more natural. In the second stage, we propose a novel CIT reasoning block for establishing the global mutual interactive dependence. Based on this mutual dependence, the significant region within the input data can be highlighted, and consequently, the try-on results can become more realistic. Extensive experiments on a public fashion dataset demonstrate that our CIT can achieve the new state-of-the-art virtual try-on performance both qualitatively and quantitatively. The source code and trained models are available at https://github.com/Amazingren/CIT.

[1]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[2]  Yuning Jiang,et al.  Controllable Person Image Synthesis With Attribute-Decomposed GAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Paul L. Rosin,et al.  CP-VTON+: Clothing Shape and Texture Preserving Image-Based Virtual Try-On , 2020 .

[4]  Masashi Nishiyama,et al.  Virtual Fitting by Single-Shot Body Shape Estimation , 2014 .

[5]  Michael J. Black,et al.  DRAPE , 2012, ACM Trans. Graph..

[6]  Hideo Saito,et al.  Texture overlay for virtual clothing based on PCA of silhouettes , 2006, 2006 IEEE/ACM International Symposium on Mixed and Augmented Reality.

[7]  Cl'ement Calauzenes,et al.  Do Not Mask What You Do Not Need to Mask: a Parser-Free Virtual Try-On , 2020, ECCV.

[8]  Ruslan Salakhutdinov,et al.  Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.

[9]  Nicu Sebe,et al.  XingGAN for Person Image Generation , 2020, ECCV.

[10]  Larry S. Davis,et al.  VITON: An Image-Based Virtual Try-on Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[12]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[13]  Ben Glocker,et al.  Attention Gated Networks: Learning to Leverage Salient Regions in Medical Images , 2018, Medical Image Anal..

[14]  Cordelia Schmid,et al.  An Affine Invariant Interest Point Detector , 2002, ECCV.

[15]  Nikolay Jetchev,et al.  The Conditional Analogy GAN: Swapping Fashion Articles on People Images , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[16]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[17]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[18]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[20]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Liang Lin,et al.  Toward Characteristic-Preserving Image-based Virtual Try-On Network , 2018, ECCV.

[23]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[24]  Fred L. Bookstein,et al.  Principal Warps: Thin-Plate Splines and the Decomposition of Deformations , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Sanja Fidler,et al.  Be Your Own Prada: Fashion Synthesis with Structural Coherence , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[29]  Nicu Sebe,et al.  Multi-Channel Attention Selection GAN With Cascaded Semantic Guidance for Cross-View Image Translation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Nicu Sebe,et al.  Bipartite Graph Reasoning GANs for Person Image Generation , 2020, BMVC.

[31]  Cordelia Schmid,et al.  Local Grayvalue Invariants for Image Retrieval , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Ruimao Zhang,et al.  Towards Photo-Realistic Virtual Try-On by Adaptively Generating↔Preserving Image Content , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Nicu Sebe,et al.  Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Nicu Sebe,et al.  Dual Attention GANs for Semantic Image Synthesis , 2020, ACM Multimedia.

[35]  Alla Sheffer,et al.  Design preserving garment transfer , 2012, ACM Trans. Graph..

[36]  Josef Sivic,et al.  Convolutional Neural Network Architecture for Geometric Matching , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Zhenhua Wang,et al.  Synthesizing Training Images for Boosting Human 3D Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[38]  Daniel Cremers,et al.  DeepWrinkles: Accurate and Realistic Clothing Modeling , 2018, ECCV.

[39]  Pascal Fua,et al.  GarNet: A Two-Stream Network for Fast and Accurate 3D Cloth Draping , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Xiaohui Xie,et al.  VTNFP: An Image-Based Virtual Try-On Network With Body and Clothing Feature Preservation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).