Layout Generation and Completion with Self-attention

We address the problem of layout generation for diverse domains such as images, documents, and mobile applications. A layout is a set of graphical elements, belonging to one or more categories, placed together in a meaningful way. Generating a new layout or extending an existing layout requires understanding the relationships between these graphical elements. To do this, we propose a novel framework, LayoutTransformer, that leverages a self-attention based approach to learn contextual relationships between layout elements and generate layouts in a given domain. The proposed model improves upon the state-of-the-art approaches in layout generation in four ways. First, our model can generate a new layout either from an empty set or add more elements to a partial layout starting from an initial set of elements. Second, as the approach is attention-based, we can visualize which previous elements the model is attending to predict the next element, thereby providing an interpretable sequence of layout elements. Third, our model can easily scale to support both a large number of element categories and a large number of elements per layout. Finally, the model also produces an embedding for various element categories, which can be used to explore the relationships between the categories. We demonstrate with experiments that our model can produce meaningful layouts in diverse settings such as object bounding boxes in scenes (COCO bounding boxes), documents (PubLayNet), and mobile applications (RICO dataset).

[1]  Derek Hoiem,et al.  ViCo: Word Embeddings From Visual Co-Occurrences , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[3]  Jiajun Wu,et al.  Neural Scene De-rendering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yike Guo,et al.  Semantic Image Synthesis via Adversarial Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[11]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Jan Kautz,et al.  Context-aware Synthesis and Placement of Object Instances , 2018, NeurIPS.

[14]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Jeffrey Nichols,et al.  Rico: A Mobile App Dataset for Building Data-Driven Design Applications , 2017, UIST.

[16]  Lior Wolf,et al.  Specifying Object Attributes and Relations in Interactive Scene Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ersin Yumer,et al.  Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Angel X. Chang,et al.  PlanIT , 2019, ACM Transactions on Graphics.

[21]  Tingfa Xu,et al.  LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators , 2019, ICLR.

[22]  Stefan Wermter,et al.  Generating Multiple Objects at Spatially Distinct Locations , 2019, ICLR.

[23]  Angel X. Chang,et al.  Deep convolutional priors for indoor scene synthesis , 2018, ACM Trans. Graph..

[24]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Jian Gu,et al.  Seq-SG2SL: Inferring Semantic Layout From Scene Graph Through Sequence to Sequence Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Jiawei He,et al.  LayoutVAE: Stochastic Scene Layout Generation From a Label Set , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Simone Marinai,et al.  DocEmul: A Toolkit to Generate Structured Historical Documents , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[29]  Abhinav Gupta,et al.  Contextual Priming and Feedback for Faster R-CNN , 2016, ECCV.

[30]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[31]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Taesung Park,et al.  GauGAN: semantic image synthesis with spatially adaptive normalization , 2019, ACM SIGGRAPH 2019 Real-Time Live!.

[33]  Irving Biederman,et al.  On the Semantics of a Glance at a Scene , 2017 .

[34]  Christopher Potts,et al.  Text to 3D Scene Generation with Rich Lexical Grounding , 2015, ACL.

[35]  Thomas F. Liu,et al.  Learning Design Semantics for Mobile Apps , 2018, UIST.

[36]  Bo Zhao,et al.  Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[38]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[39]  Rynson W. H. Lau,et al.  Content-aware generative modeling of graphic design layouts , 2019, ACM Trans. Graph..

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Jiajun Wu,et al.  Learning to See Physics via Visual De-animation , 2017, NIPS.

[42]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Hermann Ney,et al.  Improvements in beam search , 1994, ICSLP.

[44]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[45]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[46]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[47]  Antonio Torralba,et al.  Statistical Context Priming for Object Detection , 2001, ICCV.