ACT: an Attentive Convolutional Transformer for Efficient Text Classification

Recently, Transformer has been demonstrating promising performance in many NLP tasks and showing a trend of replacing Recurrent Neural Network (RNN). Meanwhile, less attention is drawn to Convolutional Neural Network (CNN) due to its weak ability in capturing sequential and longdistance dependencies, although it has excellent local feature extraction capability. In this paper, we introduce an Attentive Convolutional Transformer (ACT) that takes the advantages of both Transformer and CNN for efficient text classification. Specifically, we propose a novel attentive convolution mechanism that utilizes the semantic meaning of convolutional filters attentively to transform text from complex word space to a more informative convolutional filter space where important n-grams are captured. ACT is able to capture both local and global dependencies effectively while preserving sequential information. Experiments on various text classification tasks and detailed analyses show that ACT is a lightweight, fast, and effective universal text classifier, outperforming CNNs, RNNs, and attentive models including Transformer.

[1]  Huanbo Luan,et al.  Improving the Transformer Translation Model with Document-Level Context , 2018, EMNLP.

[2]  Danqi Chen,et al.  Position-aware Attention and Supervised Data Improve Slot Filling , 2017, EMNLP.

[3]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[4]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[8]  Benjamin Roth,et al.  Position-aware Self-attention with Relative Positional Encodings for Slot Filling , 2018, ArXiv.

[9]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[10]  Yaohui Jin,et al.  A Generalized Recurrent Neural Architecture for Text Classification with Multi-Task Learning , 2017, IJCAI.

[11]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[12]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[13]  Chunyan Miao,et al.  Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations , 2019, EMNLP.

[14]  Preslav Nakov,et al.  SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals , 2009, SEW@NAACL-HLT.

[15]  Qi Li,et al.  Improving convolutional neural network for text classification by recursive data pruning , 2020, Neurocomputing.

[16]  Pengfei Li,et al.  Knowledge-oriented convolutional neural network for causal relation extraction from natural language texts , 2019, Expert Syst. Appl..

[17]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[18]  Chengqi Zhang,et al.  Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling , 2018, ICLR.

[19]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[20]  Ralph Grishman,et al.  Relation Extraction: Perspective from Convolutional Neural Networks , 2015, VS@HLT-NAACL.

[21]  Alexandre Denis,et al.  Do Convolutional Networks need to be Deep for Text Classification ? , 2017, AAAI Workshops.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[24]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[25]  Wei Shi,et al.  Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification , 2016, ACL.

[26]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[27]  Ting Liu,et al.  Gaussian Transformer: A Lightweight Approach for Natural Language Inference , 2019, AAAI.

[28]  Christopher D. Manning,et al.  Graph Convolution over Pruned Dependency Trees Improves Relation Extraction , 2018, EMNLP.

[29]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[30]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[31]  Wang Ling,et al.  Generative and Discriminative Text Classification with Recurrent Neural Networks , 2017, ArXiv.

[32]  Guoyin Wang,et al.  Joint Embedding of Words and Labels for Text Classification , 2018, ACL.

[33]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[34]  Yann LeCun,et al.  Very Deep Convolutional Networks for Natural Language Processing , 2016, ArXiv.

[35]  Ali Farhadi,et al.  Neural Speed Reading via Skim-RNN , 2017, ICLR.

[36]  Zhaopeng Tu,et al.  Convolutional Self-Attention Networks , 2019, NAACL.

[37]  Kezhi Mao,et al.  Improving Relation Extraction with Knowledge-attention , 2019, EMNLP.

[38]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[39]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[40]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.