Dynamic Inference with Neural Interpreters

Modern neural network architectures can leverage large amounts of data to generalize well within the training distribution. However, they are less capable of systematic generalization to data drawn from unseen but related distributions, a feat that is hypothesized to require compositional reasoning and reuse of knowledge. In this work, we present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules, which we call functions. Inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. The proposed architecture can flexibly compose computation along width and depth, and lends itself well to capacity extension after training. To demonstrate the versatility of Neural Interpreters, we evaluate it in two distinct settings: image classification and visual abstract reasoning on Raven Progressive Matrices. In the former, we show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner. In the latter, we find that Neural Interpreters are competitive with respect to the state-of-the-art in terms of systematic generalization.

[1]  David Barber,et al.  Modular Networks: Learning to Decompose Neural Computation , 2018, NeurIPS.

[2]  Yoshua Bengio,et al.  Inductive Biases for Deep Learning of Higher-Level Cognition , 2020, ArXiv.

[3]  Chuang Gan,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2020, ICLR.

[4]  Marco Baroni,et al.  Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks , 2018, BlackboxNLP@EMNLP.

[5]  Yoshua Bengio,et al.  Measuring the tendency of CNNs to Learn Surface Statistical Regularities , 2017, ArXiv.

[6]  Aaron C. Courville,et al.  Systematic Generalization: What Is Required and Can It Be Learned? , 2018, ICLR.

[7]  Bernhard Schölkopf,et al.  Recurrent Independent Mechanisms , 2021, ICLR.

[8]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[9]  MarchandMario,et al.  Domain-adversarial training of neural networks , 2016 .

[10]  Chrisantha Fernando,et al.  PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[11]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[12]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[14]  Alex Lamb,et al.  Deep Learning for Classical Japanese Literature , 2018, ArXiv.

[15]  Charles Blundell,et al.  Neural Production Systems , 2021, ArXiv.

[16]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[17]  Yifan Hu,et al.  Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference , 2020, EMNLP.

[18]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Bernhard Schölkopf,et al.  A theory of independent mechanisms for extrapolation in generative models , 2020, AAAI.

[21]  Simon L. Kendal,et al.  An introduction to knowledge engineering , 2007 .

[22]  Xiao Wang,et al.  Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , 2019, ICLR.

[23]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[24]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[25]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Jimmy Ba,et al.  The Scattering Compositional Learner: Discovering Objects, Attributes, Relationships in Analogical Reasoning , 2020, ArXiv.

[29]  Bernhard Schölkopf,et al.  Learning Independent Causal Mechanisms , 2017, ICML.

[30]  Thomas L. Griffiths,et al.  Automatically Composing Representation Transformations as a Means for Generalization , 2018, ICLR.

[31]  Matthew Riemer,et al.  Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.

[32]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[33]  Tim Verbelen,et al.  Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations , 2018, NIPS 2018.

[34]  Bernhard Schölkopf,et al.  On causal and anticausal learning , 2012, ICML.

[35]  Ignacio Cases,et al.  Routing Networks and the Challenges of Modular and Compositional Computation , 2019, ArXiv.

[36]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[37]  Yoshua Bengio,et al.  Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems , 2020, ArXiv.

[38]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[39]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[40]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[41]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[42]  Yoshua Bengio,et al.  Transformers with Competitive Ensembles of Independent Mechanisms , 2021, ArXiv.

[43]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[44]  Noah D. Goodman,et al.  Neural Event Semantics for Grounded Language Understanding , 2021, Transactions of the Association for Computational Linguistics.

[45]  Victor Lempitsky,et al.  Image Generators with Conditionally-Independent Pixel Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Felix Hill,et al.  Measuring abstract reasoning in neural networks , 2018, ICML.

[47]  Christopher Joseph Pal,et al.  A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms , 2019, ICLR.

[48]  Shlomo Zilberstein,et al.  Using Anytime Algorithms in Intelligent Systems , 1996, AI Mag..

[49]  Pietro Liò,et al.  Abstract Diagrammatic Reasoning with Multiplex Graph Networks , 2020, ICLR.

[50]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[51]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[52]  Yoshua Bengio,et al.  Towards Causal Representation Learning , 2021, ArXiv.

[53]  J. Raven,et al.  Manual for Raven's progressive matrices and vocabulary scales , 1962 .

[54]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[55]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[56]  Marc'Aurelio Ranzato,et al.  Few-shot Sequence Learning with Transformers , 2020, ArXiv.

[57]  Jürgen Schmidhuber,et al.  Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[58]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.