A Survey of Data Augmentation Approaches for NLP

Data augmentation has recently seen increased interest in NLP due to more work in lowresource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP.

[1]  Xiang Wan,et al.  Named Entity Recognition for Social Media Texts with Semantic Augmentation , 2020, EMNLP.

[2]  Zhijian Ou,et al.  Paraphrase Augmented Task-Oriented Dialog Generation , 2020, ACL.

[3]  Tri Dao,et al.  A Kernel Theory of Modern Data Augmentation , 2018, ICML.

[4]  Hany Hassan,et al.  Synthetic Data for Neural Machine Translation of Spoken-Dialects , 2017, IWSLT.

[5]  Jian Huang,et al.  Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks , 2018, AVEC@MM.

[6]  Hadrien Glaude,et al.  A Closer Look At Feature Space Data Augmentation For Few-Shot Intent Classification , 2019, EMNLP.

[7]  Yun Chen,et al.  Controllable data synthesis method for grammatical error correction , 2019, Frontiers of Computer Science.

[8]  Yuji Matsumoto,et al.  Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners , 2011, IJCNLP.

[9]  Jacob Eisenstein,et al.  AdvAug: Robust Adversarial Augmentation for Neural Machine Translation , 2020, ACL.

[10]  Jason Weston,et al.  Vicinal Risk Minimization , 2000, NIPS.

[11]  R Devon Hjelm,et al.  On Adversarial Mixup Resynthesis , 2019, NeurIPS.

[12]  Dmitrij Šešok,et al.  Text Augmentation Using BERT for Image Captioning , 2020, Applied Sciences.

[13]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[14]  Qun Liu,et al.  Dictionary-based Data Augmentation for Cross-Domain Neural Machine Translation , 2020, ArXiv.

[15]  Jianfeng Gao,et al.  Data Augmentation for Abstractive Query-Focused Multi-Document Summarization , 2021, AAAI.

[16]  Daniel Jurafsky,et al.  Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction , 2018, NAACL.

[17]  Linlin Liu,et al.  DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks , 2020, EMNLP.

[18]  Xuemin Wang,et al.  A Survey of Text Data Augmentation , 2020, 2020 International Conference on Computer Communication and Network Security (CCNS).

[19]  Askars Salimbajevs,et al.  Data Augmentation for Pipeline-Based Speech Translation , 2020, Baltic HLT.

[20]  Alla Rozovskaya,et al.  A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction , 2020, BEA.

[21]  Yu Wang,et al.  How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers? , 2020, FINDINGS.

[22]  Dongyeop Kang,et al.  AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples , 2018, ACL.

[23]  Adam Lopez,et al.  A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages , 2019, EMNLP/IJCNLP.

[24]  Sida I. Wang,et al.  Grounded Adaptation for Zero-shot Executable Semantic Parsing , 2020, EMNLP.

[25]  Teruko Mitamura,et al.  GenAug: Data Augmentation for Finetuning Text Generators , 2020, DEELIO.

[26]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[27]  Shafiq Joty,et al.  Data Diversification: A Simple Strategy For Neural Machine Translation , 2020, NeurIPS.

[28]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Monica S. Lam,et al.  Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking , 2020, ACL.

[30]  Nassir Navab,et al.  Data Augmentation with Manifold Exploring Geometric Transformations for Increased Performance and Robustness , 2019, ArXiv.

[31]  Marcin Junczys-Dowmunt,et al.  Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data , 2019, BEA@ACL.

[32]  Rogério Schmidt Feris,et al.  Delta-encoder: an effective sample synthesis method for few-shot object recognition , 2018, NeurIPS.

[33]  Eduard Hovy,et al.  Explaining The Efficacy of Counterfactually-Augmented Data , 2020, ArXiv.

[34]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[35]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[36]  A. P. Sarath Chandar,et al.  PatchUp: A Regularization Technique for Convolutional Neural Networks , 2020, ArXiv.

[37]  Paul Buitelaar,et al.  Augmenting Neural Machine Translation with Knowledge Graphs , 2019, ArXiv.

[38]  Soroush Vosoughi,et al.  Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning , 2021, NAACL.

[39]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[40]  Kenton Lee,et al.  Neural Data Augmentation via Example Extrapolation , 2021, ArXiv.

[41]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[42]  Bharath Hariharan,et al.  Low-Shot Visual Recognition by Shrinking and Hallucinating Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Kyunghyun Cho,et al.  SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness , 2020, EMNLP.

[44]  Arda Tezcan,et al.  Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation , 2019, ACL.

[45]  Cheng Zhang,et al.  Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight , 2020, ACL.

[46]  Songlin Hu,et al.  Data Augmentation for Copy-Mechanism in Dialogue State Tracking , 2020, ICCS.

[47]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[48]  Christof Monz,et al.  Data Augmentation for Low-Resource Neural Machine Translation , 2017, ACL.

[49]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[50]  Maryam Fazel-Zarandi,et al.  Data Augmentation for Training Dialog Models Robust to Speech Recognition Errors , 2020, NLP4CONVAI.

[51]  Tie-Yan Liu,et al.  Soft Contextual Data Augmentation for Neural Machine Translation , 2019, ACL.

[52]  Tao Yu,et al.  Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions , 2019, EMNLP.

[53]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[54]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[55]  Yo Joong Choe,et al.  A Neural Grammatical Error Correction System Built On Better Pre-training and Sequential Transfer Learning , 2019, BEA@ACL.

[56]  Noam M. Shazeer,et al.  Corpora Generation for Grammatical Error Correction , 2019, NAACL.

[57]  Hongyu Guo,et al.  Nonlinear Mixup: Out-Of-Manifold Data Augmentation for Text Classification , 2020, AAAI.

[58]  Anupam Datta,et al.  Gender Bias in Neural Natural Language Processing , 2018, Logic, Language, and Security.

[59]  Jennifer Foster,et al.  GenERRate: Generating Errors for Use in Grammatical Error Detection , 2009, BEA@NAACL.

[60]  Mi-young Kang VaLaR NMT : Vastly Lacking Resources Neural Machine Translation , 2019 .

[61]  Daniel Zeng,et al.  MDA: Multimodal Data Augmentation Framework for Boosting Performance on Sentiment/Emotion Classification Tasks , 2021, IEEE Intelligent Systems.

[62]  Mark Steedman,et al.  Data Augmentation via Dependency Tree Morphing for Low-Resource Languages , 2018, EMNLP.

[63]  Jacopo Staiano,et al.  Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering , 2020, EMNLP.

[64]  Iryna Gurevych,et al.  Improving Robustness by Augmenting Training Sentences with Predicate-Argument Structures , 2020, ArXiv.

[65]  Graham Neubig,et al.  Improving Robustness of Machine Translation with Synthetic Noise , 2019, NAACL.

[66]  Ashutosh Kumar,et al.  Syntax-Guided Controlled Generation of Paraphrases , 2020, Transactions of the Association for Computational Linguistics.

[67]  Ramit Sawhney,et al.  SpeechMix - Augmenting Deep Sound Recognition Using Hidden Space Interpolations , 2020, INTERSPEECH.

[68]  Tom M. Mitchell,et al.  Learning Data Manipulation for Augmentation and Weighting , 2019, NeurIPS.

[69]  Peter König,et al.  Data augmentation instead of explicit regularization , 2018, ArXiv.

[70]  Percy Liang,et al.  Data Recombination for Neural Semantic Parsing , 2016, ACL.

[71]  Hermann Ney,et al.  Generalizing Back-Translation in Neural Machine Translation , 2019, WMT.

[72]  Yating Yang,et al.  A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation , 2020, Inf..

[73]  Kai Yu,et al.  Data Augmentation with Atomic Templates for Spoken Language Understanding , 2019, EMNLP.

[74]  Timothy Baldwin,et al.  Robust Training under Linguistic Adversity , 2017, EACL.

[75]  Ateret Anaby-Tavor,et al.  Do Not Have Enough Data? Deep Learning to the Rescue! , 2020, AAAI.

[76]  Partha Talukdar,et al.  Submodular Optimization-based Diverse Paraphrasing and its Effectiveness in Data Augmentation , 2019, NAACL.

[77]  Luke S. Zettlemoyer,et al.  Adversarial Example Generation with Syntactically Controlled Paraphrase Networks , 2018, NAACL.

[78]  Spyridon Samothrakis,et al.  Textual Data Augmentation for Efficient Active Learning on Tiny Datasets , 2020, EMNLP.

[79]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[80]  Quoc V. Le,et al.  Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[82]  Johannes Heinecke,et al.  Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers , 2020, WEBNLG.

[83]  Richard Socher,et al.  XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering , 2019, ArXiv.

[84]  Chris Callison-Burch,et al.  PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , 2015, ACL.

[85]  Quoc V. Le,et al.  RandAugment: Practical data augmentation with no separate search , 2019, ArXiv.

[86]  Satoshi Nakamura,et al.  Multi-Source Neural Machine Translation with Data Augmentation , 2018, IWSLT.

[87]  Naoki Yoshinaga,et al.  Data augmentation using back-translation for context-aware neural machine translation , 2019, EMNLP.

[88]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[89]  Petr Motlicek,et al.  Abstract Text Summarization: A Low Resource Challenge , 2019, EMNLP.

[90]  Rebecca Hwa,et al.  Quantifying the Evaluation of Heuristic Methods for Textual Data Augmentation , 2020, WNUT.

[91]  Lucia Specia,et al.  Improving Neural Machine Translation Robustness via Data Augmentation: Beyond Back Translation , 2019, W-NUT@EMNLP.

[92]  Alberto L. Sangiovanni-Vincentelli,et al.  Counterexample-Guided Data Augmentation , 2018, IJCAI.

[93]  Iryna Gurevych,et al.  Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks , 2021, NAACL.

[94]  Kathleen McKeown,et al.  A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models , 2019, INLG.

[95]  Tao Yu,et al.  GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing , 2021, ICLR.

[96]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[97]  Xiangyang Mou,et al.  Multimodal Dialogue State Tracking By QA Approach with Data Augmentation , 2020, ArXiv.

[98]  Xiaojun Wan,et al.  Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation , 2020, COLING.

[99]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[100]  Deyi Xiong,et al.  Effective Data Augmentation Approaches to End-to-End Task-Oriented Dialogue , 2019, 2019 International Conference on Asian Language Processing (IALP).

[101]  Francisco Casacuberta,et al.  Adapting Neural Machine Translation with Parallel Synthetic Data , 2017, WMT.

[102]  Hideki Nakayama,et al.  Augmenting Image Question Answering Dataset by Exploiting Image Captions , 2018, LREC.

[103]  Jacob Andreas,et al.  Good-Enough Compositional Data Augmentation , 2019, ACL.

[104]  Ioannis Konstas,et al.  Findings of the Third Workshop on Neural Generation and Translation , 2019, EMNLP.

[105]  Eduard Hovy,et al.  NAREOR: The Narrative Reordering Problem , 2021, ArXiv.

[106]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[107]  Naveen Arivazhagan,et al.  Sentence Boundary Augmentation for Neural Machine Translation Robustness , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[108]  Graham Neubig,et al.  Generalized Data Augmentation for Low-Resource Translation , 2019, ACL.

[109]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[110]  Yijia Liu,et al.  Sequence-to-Sequence Data Augmentation for Dialogue Language Understanding , 2018, COLING.

[111]  Aditi Raghunathan,et al.  Certified Robustness to Adversarial Word Substitutions , 2019, EMNLP.

[112]  Ramit Sawhney,et al.  Augmenting NLP models using Latent Feature Interpolations , 2020, COLING.

[113]  Ryan Cotterell,et al.  It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution , 2019, EMNLP.

[114]  Ioannis Mitliagkas,et al.  Manifold Mixup: Better Representations by Interpolating Hidden States , 2018, ICML.

[115]  Shinji Watanabe,et al.  Multi-Modal Data Augmentation for End-to-end ASR , 2018, INTERSPEECH.

[116]  Po-Ling Loh,et al.  Does Data Augmentation Lead to Positive Margin? , 2019, ICML.

[117]  Wei Emma Zhang,et al.  Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering , 2020, ECCV.

[118]  Zhucheng Tu,et al.  An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering , 2019, EMNLP.

[119]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[120]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[121]  Libo Qin,et al.  CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP , 2020, ArXiv.

[122]  Verena Rieser,et al.  Findings of the E2E NLG Challenge , 2018, INLG.

[123]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[124]  Ryan Cotterell,et al.  Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology , 2019, ACL.

[125]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[126]  Adriane Boyd,et al.  Using Wikipedia Edits in Low Resource Grammatical Error Correction , 2018, NUT@EMNLP.

[127]  Claire Gardent,et al.  The WebNLG Challenge: Generating Text from RDF Data , 2017, INLG.

[128]  Seiichi Uchida,et al.  An empirical survey of data augmentation for time series classification with neural networks , 2020, PloS one.

[129]  Shubhangi Tandon,et al.  TNT-NLG , System 2 : Data Repetition and Meaning Representation Manipulation to Improve Neural Generation , 2018 .

[130]  Doug Downey,et al.  G-DAug: Generative Data Augmentation for Commonsense Reasoning , 2020, FINDINGS.

[131]  Haoran Li,et al.  Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation , 2020, ArXiv.

[132]  Hwa-Yeon Kim,et al.  Data Augmentation by Data Noising for Open-vocabulary Slots in Spoken Language Understanding , 2019, NAACL.

[133]  Alexander Rush,et al.  Sequence-level Mixed Sample Data Augmentation , 2020, EMNLP.

[134]  Oswald Lanz,et al.  Data augmentation techniques for the Video Question Answering task , 2020, ECCV Workshops.

[135]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[136]  Kevin Gimpel,et al.  Substructure Substitution: Structured Data Augmentation for NLP , 2021, FINDINGS.

[137]  Hannaneh Hajishirzi,et al.  Logic-Guided Data Augmentation and Regularization for Consistent Question Answering , 2020, ACL.

[138]  Furu Wei,et al.  Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting , 2019, ArXiv.

[139]  Ignacio Iacobacci,et al.  Conversation Graph: Data Augmentation, Training, and Evaluation for Non-Deterministic Dialogue Management , 2020, TACL.

[140]  Long Qin,et al.  Erroneous data generation for Grammatical Error Correction , 2019, BEA@ACL.

[141]  Hai Zhao,et al.  Syntax-Aware Data Augmentation for Neural Machine Translation , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[142]  Soroush Vosoughi,et al.  Enhanced Offensive Language Detection Through Data Augmentation , 2020, ArXiv.

[143]  Bernardo Magnini,et al.  Simple is Better! Lightweight Data Augmentation for Low Resource Slot Filling and Intent Classification , 2020, PACLIC.

[144]  Christopher Kanan,et al.  Data Augmentation for Visual Question Answering , 2017, INLG.

[145]  Qun Liu,et al.  Dialog State Tracking with Reinforced Data Augmentation , 2019, AAAI.

[146]  Christopher Ré,et al.  Learning to Compose Domain-Specific Transformations for Data Augmentation , 2017, NIPS.

[147]  Kevin Gimpel,et al.  Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings , 2017, ACL.

[148]  Zhijian Ou,et al.  Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context , 2019, AAAI.

[149]  Jimmy J. Lin,et al.  Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering , 2019, ArXiv.

[150]  Chao Zhang,et al.  SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup , 2020, EMNLP.

[151]  Soroush Vosoughi,et al.  Text Augmentation in a Multi-Task View , 2021, ArXiv.

[152]  Hongxia Jin,et al.  Negative Data Augmentation , 2021, ICLR.

[153]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[154]  Graham Neubig,et al.  SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation , 2018, EMNLP.

[155]  Heike Adel,et al.  An Analysis of Simple Data Augmentation for Named Entity Recognition , 2020, COLING.

[156]  Sosuke Kobayashi,et al.  Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations , 2018, NAACL.

[157]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[158]  Sang-goo Lee,et al.  Data Augmentation for Spoken Language Understanding via Joint Variational Generation , 2018, AAAI.

[159]  Li Dong,et al.  Transforming Wikipedia into Augmented Data for Query-Focused Summarization , 2019, ArXiv.

[160]  Soroush Vosoughi,et al.  Data Boost: Text Data Augmentation through Reinforcement Learning Guided Conditional Generation , 2020, EMNLP.

[161]  Mariano Felice,et al.  Artificial error generation for translation-based grammatical error correction , 2016 .

[162]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[163]  Ethan Dyer,et al.  Tradeoffs in Data Augmentation: An Empirical Study , 2021, ICLR.

[164]  Shiguo Lian,et al.  A survey on face data augmentation for the training of deep neural networks , 2019, Neural Computing and Applications.