Compositional Attention: Disentangling Search and Retrieval

Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search – selection of a relevant entity from a set via query-key interactions, and (2) retrieval – extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner through an additional soft competition stage between the query-key combination and value pairing. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval, and can easily be implemented in lieu of standard attention heads in any network architecture.1

[1]  Zhiying Jiang,et al.  Investigating the Limitations of the Transformers with Simple Arithmetic Tasks , 2021, ArXiv.

[2]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[3]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[4]  Murray Shanahan,et al.  Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules , 2020, ICML.

[5]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[6]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Felix Hill,et al.  Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures , 2020, ArXiv.

[9]  Yoshua Bengio,et al.  Inductive Biases for Deep Learning of Higher-Level Cognition , 2020, ArXiv.

[10]  Alex Lamb,et al.  Coordination Among Neural Modules Through a Shared Global Workspace , 2021, ArXiv.

[11]  Vaidheeswaran Archana,et al.  Compositional Attention Networks for Interpretability in Natural Language Question Answering , 2018, ArXiv.

[12]  S. Dehaene,et al.  What is consciousness, and could machines have it? , 2017, Science.

[13]  Stephen M. Omohundro,et al.  Equilateral Triangles: A Challenge for Connectionist Vision , 2009 .

[14]  B. Baars IN THE THEATRE OF CONSCIOUSNESS Global Workspace Theory, A Rigorous Scientific Theory of Consciousness. , 1997 .

[15]  Ishan Sinha,et al.  Emergent Symbols through Binding in External Memory , 2020, ICLR.

[16]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[17]  Charles Blundell,et al.  Neural Production Systems , 2021, ArXiv.

[18]  Xiao Wang,et al.  Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , 2019, ICLR.

[19]  Mathijs Mul,et al.  Compositionality Decomposed: How do Neural Networks Generalise? , 2019, J. Artif. Intell. Res..

[20]  Kazuki Irie,et al.  The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers , 2021, EMNLP.

[21]  Yoshua Bengio,et al.  The Consciousness Prior , 2017, ArXiv.

[22]  Christopher Joseph Pal,et al.  A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms , 2019, ICLR.

[23]  Danilo Jimenez Rezende,et al.  Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning , 2021, NeurIPS Datasets and Benchmarks.

[24]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[25]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[26]  Chen Liang,et al.  Compositional Generalization via Neural-Symbolic Stack Machines , 2020, NeurIPS.

[27]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[28]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[31]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[34]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[35]  Yoshua Bengio,et al.  Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems , 2020, ArXiv.

[36]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[37]  Bernhard Schölkopf,et al.  Recurrent Independent Mechanisms , 2021, ICLR.

[38]  Yoshua Bengio,et al.  Untangling tradeoffs between recurrence and self-attention in artificial neural networks , 2020, NeurIPS.

[39]  Benjamin Newman,et al.  The EOS Decision and Length Extrapolation , 2020, BLACKBOXNLP.

[40]  Anirudh Goyal,et al.  Fast and Slow Learning of Recurrent Independent Mechanisms , 2021, ICLR.

[41]  Liang Zhao,et al.  Compositional Generalization for Primitive Substitutions , 2019, EMNLP.

[42]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[43]  Yoshua Bengio,et al.  Transformers with Competitive Ensembles of Independent Mechanisms , 2021, ArXiv.

[44]  Marco Baroni,et al.  Memorize or generalize? Searching for a compositional RNN in a haystack , 2018, ArXiv.