Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences

Autoregressive models have proven to be very powerful in NLP text generation tasks and lately have gained popularity for image generation as well. However, they have seen limited use for the synthesis of 3D shapes so far. This is mainly due to the lack of a straightforward way to linearize 3D data as well as to scaling problems with the length of the resulting sequences when describing complex shapes. In this work we address both of these problems. We use octrees as a compact hierarchical shape representation that can be sequentialized by traversal ordering. Moreover, we introduce an adaptive compression scheme, that significantly reduces sequence lengths and thus enables their effective generation with a transformer, while still allowing fully autoregressive sampling and parallel training. We demonstrate the performance of our model by comparing against the state-of-the-art in shape generation.

[1]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[2]  Thomas Brox,et al.  Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Donald Meagher,et al.  Geometric modeling using octree encoding , 1982, Comput. Graph. Image Process..

[4]  Hao Zhang,et al.  Learning Implicit Fields for Generative Shape Modeling , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yang Zhang,et al.  Point Cloud GAN , 2018, DGS@ICLR.

[6]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[7]  S. M. Ali Eslami,et al.  PolyGen: An Autoregressive Generative Model of 3D Meshes , 2020, ICML.

[8]  Leonidas J. Guibas,et al.  Learning Representations and Generative Models for 3D Point Clouds , 2017, ICML.

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[13]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[14]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[15]  Jitendra Malik,et al.  Hierarchical Surface Prediction for 3D Object Reconstruction , 2017, 2017 International Conference on 3D Vision (3DV).

[16]  Jerome H. Friedman,et al.  A New Graph-Based Two-Sample Test for Multivariate and Object Data , 2013, 1307.6294.

[17]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[18]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[19]  Leif Kobbelt,et al.  A Convolutional Decoder for Point Clouds using Adaptive Instance Normalization , 2019, Comput. Graph. Forum.

[20]  Franccois Fleuret,et al.  Fast Transformers with Clustered Attention , 2020, NeurIPS.

[21]  Leif Kobbelt,et al.  3D Shape Generation with Grid-based Implicit Functions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yue Wang,et al.  PointGrow: Autoregressively Learned Point Cloud Generation with Self-Attention , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Yang Liu,et al.  Adaptive O-CNN: A Patch-based Deep Representation of 3D Shapes , 2018 .

[25]  Gernot Riegler,et al.  OctNet: Learning Deep 3D Representations at High Resolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[27]  Oliver Grau,et al.  VConv-DAE: Deep Volumetric Shape Learning Without Object Labels , 2016, ECCV Workshops.

[28]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Patrick Esser,et al.  Taming Transformers for High-Resolution Image Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Theodore Lim,et al.  Generative and Discriminative Voxel Modeling with Convolutional Neural Networks , 2016, ArXiv.

[31]  Ming Ouhyoung,et al.  On Visual Similarity Based 3D Model Retrieval , 2003, Comput. Graph. Forum.

[32]  Christopher K. I. Williams,et al.  The shape variational autoencoder: A deep generative model of part‐segmented 3D objects , 2017, Comput. Graph. Forum.

[33]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .