Area Attention

Existing attention mechanisms are trained to attend to individual items in a collection (the memory) with a predefined, fixed granularity, e.g., a word token or an image grid. We propose area attention: a way to attend to areas in the memory, where each area contains a group of items that are structurally adjacent, e.g., spatially for a 2D memory such as images, or temporally for a 1D memory such as natural language sentences. Importantly, the shape and the size of an area are dynamically determined via learning, which enables a model to attend to information with varying granularity. Area attention can easily work with existing model architectures such as multi-head attention for simultaneously attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation (both character and token-level) and image captioning, and improve upon strong (state-of-the-art) baselines in all the cases. These improvements are obtainable with a basic form of area attention that is parameter free.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Dan Klein,et al.  Constituency Parsing with a Self-Attentive Encoder , 2018, ACL.

[3]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[6]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[7]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[8]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[9]  Baobao Chang,et al.  Graph-based Dependency Parsing with Bidirectional LSTM , 2016, ACL.

[10]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[11]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[12]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Anders Søgaard,et al.  Zero-Shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens , 2018, NAACL.

[15]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[16]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[17]  Dan Klein,et al.  A Minimal Span-Based Neural Constituency Parser , 2017, ACL.

[18]  Richard Szeliski,et al.  Computer Vision - Algorithms and Applications , 2011, Texts in Computer Science.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[22]  Hung-yi Lee,et al.  Neural Attention Models for Sequence Classification: Analysis and Application to Key Term Extraction and Dialogue Act Detection , 2016, INTERSPEECH.

[23]  Alexander M. Rush,et al.  Structured Attention Networks , 2017, ICLR.

[24]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Tao Mei,et al.  Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Lucas J. van Vliet,et al.  Robust local max-min filters by normalized power-weighted filtering , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[28]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[29]  Vlad Niculae,et al.  A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.

[30]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[31]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.