Scalable Adaptive Computation for Iterative Generation

Natural data is redundant yet predominant architectures tile computation uniformly across their input and output space. We propose the Recurrent Interface Networks (RINs), an attention-based architecture that decouples its core computation from the dimensionality of the data, enabling adaptive computation for more scalable generation of high-dimensional data. RINs focus the bulk of computation (i.e. global self-attention) on a set of latent tokens, using cross-attention to read and write (i.e. route) information between latent and data tokens. Stacking RIN blocks allows bottom-up (data to latent) and top-down (latent to data) feedback, leading to deeper and more expressive routing. While this routing introduces challenges, this is less problematic in recurrent computation settings where the task (and routing problem) changes gradually, such as iterative generation with diffusion models. We show how to leverage recurrence by conditioning the latent tokens at each forward pass of the reverse diffusion process with those from prior computation, i.e. latent self-conditioning. RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to 1024X1024 images without cascades or guidance, while being domain-agnostic and up to 10X more efficient than 2D and 3D U-Nets.

[1]  William S. Peebles,et al.  Scalable Diffusion Models with Transformers , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Pierre H. Richemond,et al.  Continuous diffusion for categorical data , 2022, ArXiv.

[3]  L. Sifre,et al.  Self-conditioned Embedding Diffusion for Text Generation , 2022, ArXiv.

[4]  David J. Fleet,et al.  A Generalist Framework for Panoptic Segmentation of Images and Videos , 2022, ArXiv.

[5]  Geoffrey E. Hinton,et al.  Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning , 2022, ICLR.

[6]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[7]  Troy Luhman,et al.  Improving Diffusion Model Efficiency Through Patching , 2022, ArXiv.

[8]  David J. Fleet,et al.  A Unified Sequence Interface for Vision Tasks , 2022, NeurIPS.

[9]  Tero Karras,et al.  Elucidating the Design Space of Diffusion-Based Generative Models , 2022, NeurIPS.

[10]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[11]  P. Battaglia,et al.  Transframer: Arbitrary Frame Prediction with Generative Models , 2022, Trans. Mach. Learn. Res..

[12]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  J. Álvarez,et al.  A-ViT: Adaptive Tokens for Efficient Vision Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Aäron van den Oord,et al.  Step-unrolled Denoising Autoencoders for Text Generation , 2021, ICLR.

[15]  David J. Fleet,et al.  Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.

[16]  Olivier J. H'enaff,et al.  Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.

[17]  Diederik P. Kingma,et al.  Variational Diffusion Models , 2021, ArXiv.

[18]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[19]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[20]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[21]  Aäron van den Oord,et al.  Predicting Video with VQVAE , 2021, ArXiv.

[22]  Nan Rosemary Ke,et al.  Coordination Among Neural Modules Through a Shared Global Workspace , 2021, ICLR.

[23]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[24]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[25]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[26]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[27]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[28]  Thomas Kipf,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[29]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[30]  Diego de Las Casas,et al.  Transformation-based Adversarial Video Prediction on Large-Scale Data , 2020, ArXiv.

[31]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[32]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[33]  Jeff Donahue,et al.  Adversarial Video Generation on Complex Datasets , 2019 .

[34]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[35]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[36]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[37]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[38]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[39]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Li Zhang,et al.  Spatially Adaptive Computation Time for Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.

[43]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[44]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[45]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[48]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[49]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[50]  Tomas Mikolov,et al.  Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[51]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[52]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[53]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[54]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[55]  Kunihiko Fukushima,et al.  Neocognitron: A hierarchical neural network capable of visual pattern recognition , 1988, Neural Networks.