CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

While recent progress has been made in text-to-image generation, text-to-shape generation remains a challenging problem due to the unavailability of paired text and shape data at a large scale. We present a simple yet effective method for zeroshot text-to-shape generation based on a two-stage training process, which only depends on an unlabelled shape dataset and a pre-trained image-text network such as CLIP. Our method not only demonstrates promising zero-shot generalization, but also avoids expensive inference time optimization and can generate multiple shapes for a given text. “a cuboid sofa” “a round sofa” “an airplane” “a space shuttle” “an suv” “a pickup truck” Figure 1: CLIP-Forge generates meaningful shapes without using any shape-text pairing labels.

[1]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[2]  Hao Zhang,et al.  Learning Implicit Fields for Generative Shape Modeling , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Pengfei Xiong,et al.  CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[4]  Yuan Yao,et al.  An Acceleration Framework for High Resolution Image Synthesis , 2019, ArXiv.

[5]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[6]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[7]  Kurt Keutzer,et al.  How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.

[8]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[9]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[10]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[11]  Patrick Esser,et al.  Taming Transformers for High-Resolution Image Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[13]  S. M. Ali Eslami,et al.  PolyGen: An Autoregressive Generative Model of 3D Meshes , 2020, ICML.

[14]  Matthias Niessner,et al.  ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language , 2020, ECCV.

[15]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[17]  Matthias Zwicker,et al.  Y^2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences , 2018, AAAI.

[18]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  L. B. Soros,et al.  CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders , 2021, NeurIPS.

[20]  Anindya Iqbal,et al.  Holistic static and animated 3D scene generation from diverse text descriptions , 2020, ArXiv.

[21]  Yang Zhang,et al.  Point Cloud GAN , 2018, DGS@ICLR.

[22]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[23]  Nassir Navab,et al.  Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP , 2021, ArXiv.

[24]  Ming-Yu Liu,et al.  PointFlow: 3D Point Cloud Generation With Continuous Normalizing Flows , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Yi Chang,et al.  Part2Word: Learning Joint Embedding of Point Clouds and Text by Matching Parts to Words , 2021, ArXiv.

[26]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27]  Silvio Savarese,et al.  Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings , 2018, ACCV.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Raymond J. Mooney,et al.  Generating Animated Videos of Human Activities from Natural Language Descriptions , 2018 .

[30]  Abhishek Mishra,et al.  Machine Learning in the AWS Cloud , 2019 .

[31]  Ahmed Abdelreheem,et al.  ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes , 2020, ECCV.

[32]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[33]  Hwann-Tzong Chen,et al.  Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation , 2021, AAAI.

[34]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[35]  Noah D. Goodman,et al.  Learning to Refer to 3D Objects with Natural Language , 2018 .

[36]  Daniel Cohen-Or,et al.  StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Leonidas J. Guibas,et al.  Learning Representations and Generative Models for 3D Point Clouds , 2017, ICML.

[38]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[39]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.