Time-aware Large Kernel Convolutions

To date, most state-of-the-art sequence modeling architectures use attention to build generative models for language based tasks. Some of these models use all the available sequence tokens to generate an attention distribution which results in time complexity of $O(n^2)$. Alternatively, they utilize depthwise convolutions with softmax normalized kernels of size $k$ acting as a limited-window self-attention, resulting in time complexity of $O(k{\cdot}n)$. In this paper, we introduce Time-aware Large Kernel (TaLK) Convolutions, a novel adaptive convolution operation that learns to predict the size of a summation kernel instead of using a fixed-sized kernel matrix. This method yields a time complexity of $O(n)$, effectively making the sequence encoding process linear to the number of tokens. We evaluate the proposed method on large-scale standard machine translation, abstractive summarization and language modeling datasets and show that TaLK Convolutions constitute an efficient improvement over other attention/convolution based approaches.

[1]  J. P. Lewis,et al.  Fast Template Matching , 2009 .

[2]  Yejin Choi,et al.  Deep Communicating Agents for Abstractive Summarization , 2018, NAACL.

[3]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[4]  Richard Socher,et al.  Weighted Transformer Network for Machine Translation , 2017, ArXiv.

[5]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[6]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[8]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[9]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[10]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[11]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[12]  Deyi Xiong,et al.  Accelerating Neural Transformer via an Average Attention Network , 2018, ACL.

[13]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Noah A. Smith,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016, ACL 2016.

[17]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[18]  Franklin C. Crow,et al.  Summed-area tables for texture mapping , 1984, SIGGRAPH.

[19]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[20]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[21]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[22]  Alexander M. Rush,et al.  Latent Alignment and Variational Attention , 2018, NeurIPS.

[23]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[24]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[25]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[26]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[27]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[28]  Xuanjing Huang,et al.  Cached Long Short-Term Memory Neural Networks for Document-Level Sentiment Classification , 2016, EMNLP.

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Mahmoud Nabil,et al.  CUFE at SemEval-2016 Task 4: A Gated Recurrent Model for Sentiment Classification , 2016, *SEMEVAL.

[31]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[32]  Amanda Stent,et al.  Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers) , 2018, North American Chapter of the Association for Computational Linguistics.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[35]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[36]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Rico Sennrich,et al.  Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures , 2018, EMNLP.

[39]  Sanjeev Saxena,et al.  On Parallel Prefix Computation , 1994, Parallel Process. Lett..

[40]  Christof Monz,et al.  Recurrent Memory Networks for Language Modeling , 2016, NAACL.

[41]  Peter Dayan,et al.  Fast Parametric Learning with Activation Memorization , 2018, ICML.

[42]  Szymon Rusinkiewicz,et al.  Accelerating Large-Kernel Convolution Using Summed-Area Tables , 2019, ArXiv.

[43]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[44]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[45]  Victor S. Lempitsky,et al.  Deep Neural Networks with Box Convolutions , 2018, NeurIPS.

[46]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[47]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[48]  S. C. Kremer,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[49]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[50]  Quan Pan,et al.  A Generative Model for category text generation , 2018, Inf. Sci..

[51]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[52]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Chengqi Zhang,et al.  Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling , 2018, ICLR.

[55]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[56]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[57]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[58]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[59]  Eric P. Xing,et al.  Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2014, ACL 2014.

[60]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[61]  Angela Fan,et al.  Controllable Abstractive Summarization , 2017, NMT@ACL.

[62]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[63]  Ruslan Salakhutdinov,et al.  Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function , 2019, AAAI.

[64]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[65]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[66]  Richard Socher,et al.  An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[67]  Orhan Firat,et al.  Massively Multilingual Neural Machine Translation , 2019, NAACL.

[68]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[69]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[70]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[71]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[72]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.