论文信息 - CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

While recent progress has been made in text-to-image generation, text-to-shape generation remains a challenging problem due to the unavailability of paired text and shape data at a large scale. We present a simple yet effective method for zeroshot text-to-shape generation based on a two-stage training process, which only depends on an unlabelled shape dataset and a pre-trained image-text network such as CLIP. Our method not only demonstrates promising zero-shot generalization, but also avoids expensive inference time optimization and can generate multiple shapes for a given text. “a cuboid sofa” “a round sofa” “an airplane” “a space shuttle” “an suv” “a pickup truck” Figure 1: CLIP-Forge generates meaningful shapes without using any shape-text pairing labels.

[1] Kevin Barraclough,et al. I and i , 2001, BMJ : British Medical Journal.

[2] Hao Zhang,et al. Learning Implicit Fields for Generative Shape Modeling , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[4] Yuan Yao,et al. An Acceleration Framework for High Resolution Image Synthesis , 2019, ArXiv.

[5] Jiajun Wu,et al. Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[6] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[7] Kurt Keutzer,et al. How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.

[8] Shakir Mohamed,et al. Variational Inference with Normalizing Flows , 2015, ICML.

[9] Samy Bengio,et al. Density estimation using Real NVP , 2016, ICLR.

[10] Wei Liu,et al. Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[11] Patrick Esser,et al. Taming Transformers for High-Resolution Image Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Iain Murray,et al. Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[13] S. M. Ali Eslami,et al. PolyGen: An Autoregressive Generative Model of 3D Meshes , 2020, ICML.

[14] Matthias Niessner,et al. ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language , 2020, ECCV.

[15] Sebastian Nowozin,et al. Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[17] Matthias Zwicker,et al. Y^2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences , 2018, AAAI.

[18] Richard A. Newcombe,et al. DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] L. B. Soros,et al. CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders , 2021, NeurIPS.

[20] Anindya Iqbal,et al. Holistic static and animated 3D scene generation from diverse text descriptions , 2020, ArXiv.

[21] Yang Zhang,et al. Point Cloud GAN , 2018, DGS@ICLR.

[22] Ilya Sutskever,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[23] Nassir Navab,et al. Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP , 2021, ArXiv.

[24] Ming-Yu Liu,et al. PointFlow: 3D Point Cloud Generation With Continuous Normalizing Flows , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25] Yi Chang,et al. Part2Word: Learning Joint Embedding of Point Clouds and Text by Matching Parts to Words , 2021, ArXiv.

[26] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27] Silvio Savarese,et al. Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings , 2018, ACCV.

[28] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29] Raymond J. Mooney,et al. Generating Animated Videos of Human Activities from Natural Language Descriptions , 2018 .

[30] Abhishek Mishra,et al. Machine Learning in the AWS Cloud , 2019 .

[31] Ahmed Abdelreheem,et al. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes , 2020, ECCV.

[32] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[33] Hwann-Tzong Chen,et al. Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation , 2021, AAAI.

[34] Silvio Savarese,et al. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[35] Noah D. Goodman,et al. Learning to Refer to 3D Objects with Natural Language , 2018 .

[36] Daniel Cohen-Or,et al. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Leonidas J. Guibas,et al. Learning Representations and Generative Models for 3D Point Clouds , 2017, ICML.

[38] Leonidas J. Guibas,et al. ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[39] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.