论文信息 - Compositional Attention: Disentangling Search and Retrieval - 字舞流文

Compositional Attention: Disentangling Search and Retrieval

Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search – selection of a relevant entity from a set via query-key interactions, and (2) retrieval – extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner through an additional soft competition stage between the query-key combination and value pairing. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval, and can easily be implemented in lieu of standard attention heads in any network architecture.1

Yoshua Bengio | Irina Rish | Guillaume Lajoie | Sarthak Mittal | Sharath Chandra Raparthy

[1] Zhiying Jiang,et al. Investigating the Limitations of the Transformers with Simple Arithmetic Tasks , 2021, ArXiv.

[2] Christopher D. Manning,et al. Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[3] Marco Baroni,et al. Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[4] Murray Shanahan,et al. Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules , 2020, ICML.

[5] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[6] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[7] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8] Felix Hill,et al. Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures , 2020, ArXiv.

[9] Yoshua Bengio,et al. Inductive Biases for Deep Learning of Higher-Level Cognition , 2020, ArXiv.

[10] Alex Lamb,et al. Coordination Among Neural Modules Through a Shared Global Workspace , 2021, ArXiv.

[11] Vaidheeswaran Archana,et al. Compositional Attention Networks for Interpretability in Natural Language Question Answering , 2018, ArXiv.

[12] S. Dehaene,et al. What is consciousness, and could machines have it? , 2017, Science.

[13] Stephen M. Omohundro,et al. Equilateral Triangles: A Challenge for Connectionist Vision , 2009 .

[14] B. Baars. IN THE THEATRE OF CONSCIOUSNESS Global Workspace Theory, A Rigorous Scientific Theory of Consciousness. , 1997 .

[15] Ishan Sinha,et al. Emergent Symbols through Binding in External Memory , 2020, ICLR.

[16] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.

[17] Charles Blundell,et al. Neural Production Systems , 2021, ArXiv.

[18] Xiao Wang,et al. Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , 2019, ICLR.

[19] Mathijs Mul,et al. Compositionality Decomposed: How do Neural Networks Generalise? , 2019, J. Artif. Intell. Res..

[20] Kazuki Irie,et al. The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers , 2021, EMNLP.

[21] Yoshua Bengio,et al. The Consciousness Prior , 2017, ArXiv.

[22] Christopher Joseph Pal,et al. A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms , 2019, ICLR.

[23] Danilo Jimenez Rezende,et al. Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning , 2021, NeurIPS Datasets and Benchmarks.

[24] Lukasz Kaiser,et al. Rethinking Attention with Performers , 2020, ArXiv.

[25] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ArXiv.

[26] Chen Liang,et al. Compositional Generalization via Neural-Symbolic Stack Machines , 2020, NeurIPS.

[27] Georg Heigold,et al. Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[28] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[29] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30] Razvan Pascanu,et al. A simple neural network module for relational reasoning , 2017, NIPS.

[31] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[32] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[33] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[34] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[35] Yoshua Bengio,et al. Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems , 2020, ArXiv.

[36] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[37] Bernhard Schölkopf,et al. Recurrent Independent Mechanisms , 2021, ICLR.

[38] Yoshua Bengio,et al. Untangling tradeoffs between recurrence and self-attention in artificial neural networks , 2020, NeurIPS.

[39] Benjamin Newman,et al. The EOS Decision and Length Extrapolation , 2020, BLACKBOXNLP.

[40] Anirudh Goyal,et al. Fast and Slow Learning of Recurrent Independent Mechanisms , 2021, ICLR.

[41] Liang Zhao,et al. Compositional Generalization for Primitive Substitutions , 2019, EMNLP.

[42] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[43] Yoshua Bengio,et al. Transformers with Competitive Ensembles of Independent Mechanisms , 2021, ArXiv.

[44] Marco Baroni,et al. Memorize or generalize? Searching for a compositional RNN in a haystack , 2018, ArXiv.