论文信息 - A Survey of Data Augmentation Approaches for NLP

A Survey of Data Augmentation Approaches for NLP

Data augmentation has recently seen increased interest in NLP due to more work in lowresource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP.

[1] Xiang Wan,et al. Named Entity Recognition for Social Media Texts with Semantic Augmentation , 2020, EMNLP.

[2] Zhijian Ou,et al. Paraphrase Augmented Task-Oriented Dialog Generation , 2020, ACL.

[3] Tri Dao,et al. A Kernel Theory of Modern Data Augmentation , 2018, ICML.

[4] Hany Hassan,et al. Synthetic Data for Neural Machine Translation of Spoken-Dialects , 2017, IWSLT.

[5] Jian Huang,et al. Multimodal Continuous Emotion Recognition with Data Augmentation Using Recurrent Neural Networks , 2018, AVEC@MM.

[6] Hadrien Glaude,et al. A Closer Look At Feature Space Data Augmentation For Few-Shot Intent Classification , 2019, EMNLP.

[7] Yun Chen,et al. Controllable data synthesis method for grammatical error correction , 2019, Frontiers of Computer Science.

[8] Yuji Matsumoto,et al. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners , 2011, IJCNLP.

[9] Jacob Eisenstein,et al. AdvAug: Robust Adversarial Augmentation for Neural Machine Translation , 2020, ACL.

[10] Jason Weston,et al. Vicinal Risk Minimization , 2000, NIPS.

[11] R Devon Hjelm,et al. On Adversarial Mixup Resynthesis , 2019, NeurIPS.

[12] Dmitrij Šešok,et al. Text Augmentation Using BERT for Image Captioning , 2020, Applied Sciences.

[13] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[14] Qun Liu,et al. Dictionary-based Data Augmentation for Cross-Domain Neural Machine Translation , 2020, ArXiv.

[15] Jianfeng Gao,et al. Data Augmentation for Abstractive Query-Focused Multi-Document Summarization , 2021, AAAI.

[16] Daniel Jurafsky,et al. Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction , 2018, NAACL.

[17] Linlin Liu,et al. DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks , 2020, EMNLP.

[18] Xuemin Wang,et al. A Survey of Text Data Augmentation , 2020, 2020 International Conference on Computer Communication and Network Security (CCNS).

[19] Askars Salimbajevs,et al. Data Augmentation for Pipeline-Based Speech Translation , 2020, Baltic HLT.

[20] Alla Rozovskaya,et al. A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction , 2020, BEA.

[21] Yu Wang,et al. How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers? , 2020, FINDINGS.

[22] Dongyeop Kang,et al. AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples , 2018, ACL.

[23] Adam Lopez,et al. A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages , 2019, EMNLP/IJCNLP.

[24] Sida I. Wang,et al. Grounded Adaptation for Zero-shot Executable Semantic Parsing , 2020, EMNLP.

[25] Teruko Mitamura,et al. GenAug: Data Augmentation for Finetuning Text Generators , 2020, DEELIO.

[26] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[27] Shafiq Joty,et al. Data Diversification: A Simple Strategy For Neural Machine Translation , 2020, NeurIPS.

[28] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29] Monica S. Lam,et al. Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking , 2020, ACL.

[30] Nassir Navab,et al. Data Augmentation with Manifold Exploring Geometric Transformations for Increased Performance and Robustness , 2019, ArXiv.