EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

We present EDA: easy data augmentation techniques for boosting performance on text classification tasks. EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion. On five text classification tasks, we show that EDA improves performance for both convolutional and recurrent neural networks. EDA demonstrates particularly strong results for smaller datasets; on average, across five datasets, training with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data. We also performed extensive ablation studies and suggest parameters for practical use.

[1]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[2]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[3]  Daniel Jurafsky,et al.  Data Noising as Smoothing in Neural Network Language Models , 2017, ICLR.

[4]  Diyi Yang,et al.  That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets , 2015, EMNLP.

[5]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[6]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[7]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[8]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Christof Monz,et al.  Data Augmentation for Low-Resource Neural Machine Translation , 2017, ACL.

[10]  Marie-Francine Moens,et al.  Model-Portability Experiments for Textual Temporal Analysis , 2011, ACL.

[11]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[12]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[13]  Bing Liu,et al.  Mining Opinions in Comparative Sentences , 2008, COLING.

[14]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[15]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[16]  Nikhil Ketkar,et al.  Convolutional Neural Networks , 2021, Deep Learning with Python.

[17]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[18]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[19]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[20]  Sam Shleifer Low Resource Text Classification with ULMFit and Backtranslation , 2019, ArXiv.

[21]  Christopher Kanan,et al.  Data Augmentation for Visual Question Answering , 2017, INLG.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[24]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[25]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[26]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[29]  Ling Liu,et al.  Data Augmentation for Morphological Reinflection , 2017, CoNLL.

[30]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[31]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[32]  Sosuke Kobayashi,et al.  Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations , 2018, NAACL.

[33]  Qian Liu,et al.  Automated Rule Selection for Aspect Extraction in Opinion Mining , 2015, IJCAI.

[34]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.