Quantifying the Evaluation of Heuristic Methods for Textual Data Augmentation

Data augmentation has been shown to be effective in providing more training data for machine learning and resulting in more robust classifiers. However, for some problems, there may be multiple augmentation heuristics, and the choices of which one to use may significantly impact the success of the training. In this work, we propose a metric for evaluating augmentation heuristics; specifically, we quantify the extent to which an example is “hard to distinguish” by considering the difference between the distribution of the augmented samples of different classes. Experimenting with multiple heuristics in two prediction tasks (positive/negative sentiment and verbosity/conciseness) validates our claims by revealing the connection between the distribution difference of different classes and the classification accuracy.

[1]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[2]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[3]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[4]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[5]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[6]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[7]  Diyi Yang,et al.  That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets , 2015, EMNLP.

[8]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[9]  Kevin Gimpel,et al.  Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise , 2018, NeurIPS.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[12]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[13]  Rebecca Hwa,et al.  Semantic Pleonasm Detection , 2018, NAACL.

[14]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[15]  Hwee Tou Ng,et al.  Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English , 2013, BEA@NAACL-HLT.

[16]  Sosuke Kobayashi,et al.  Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations , 2018, NAACL.

[17]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[18]  Ateret Anaby-Tavor,et al.  Do Not Have Enough Data? Deep Learning to the Rescue! , 2019, AAAI.

[19]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[20]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[22]  Franco Martín Luque,et al.  Atalaya at TASS 2019: Data Augmentation and Robust Embeddings for Sentiment Analysis , 2019, IberLEF@SEPLN.

[23]  Xiaolong Li,et al.  EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks , 2020, WWW.

[24]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[25]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[26]  John Liu,et al.  sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings , 2015, ArXiv.

[27]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[28]  Christof Monz,et al.  Data Augmentation for Low-Resource Neural Machine Translation , 2017, ACL.

[29]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[30]  Hongyu Guo,et al.  Augmenting Data with Mixup for Sentence Classification: An Empirical Study , 2019, ArXiv.

[31]  Eunah Cho,et al.  Data Augmentation using Pre-trained Transformer Models , 2020, LIFELONGNLP.

[32]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.