NAT: Noise-Aware Training for Robust Neural Sequence Labeling

Sequence labeling systems should perform reliably not only under ideal conditions but also with corrupted inputs - as these systems often process user-generated text or follow an error-prone upstream component. To this end, we formulate the noisy sequence labeling problem, where the input may undergo an unknown noising process and propose two Noise-Aware Training (NAT) objectives that improve robustness of sequence labeling performed on perturbed input: Our data augmentation method trains a neural model using a mixture of clean and noisy samples, whereas our stability training algorithm encourages the model to create a noise-invariant latent representation. We employ a vanilla noise model at training time. For evaluation, we use both the original data and its variants perturbed with real OCR errors and misspellings. Extensive experiments on English and German named entity recognition benchmarks confirmed that NAT consistently improved robustness of popular sequence labeling models, preserving accuracy on the original input. We make our code and data publicly available for the research community.

[1]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[2]  Mickaël Coustaty,et al.  ICDAR2017 Competition on Post-OCR Text Correction , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[3]  Alexander M. Rush,et al.  Adapting Sequence Models for Sentence Correction , 2017, EMNLP.

[4]  Josef van Genabith,et al.  How Robust Are Character-Based Word Embeddings in Tagging and MT Against Wrod Scramlbing or Randdm Nouse? , 2017, AMTA.

[5]  Graham Rawlinson,et al.  The Significance of Letter Position in Word Recognition , 2007, IEEE Aerospace and Electronic Systems Magazine.

[6]  A. Waibel,et al.  Toward Robust Neural Machine Translation for Noisy Input Sequences , 2017, IWSLT.

[7]  Michael Flor,et al.  A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction , 2019, BEA@ACL.

[8]  Yang Liu,et al.  Towards Robust Neural Machine Translation , 2018, ACL.

[9]  Beatrice Alex,et al.  Estimating and rating the quality of optically character recognised text , 2014, DATeCH '14.

[10]  Clemens Neudecker,et al.  An Open Corpus for Named Entity Recognition in Historic Newspapers , 2016, LREC.

[11]  Yanjun Qi,et al.  Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[12]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[13]  Huizhong Duan,et al.  Online spelling correction for query completion , 2011, WWW.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[16]  Omer Levy,et al.  Training on Synthetic Noise Improves Robustness to Natural Noise in Machine Translation , 2019, EMNLP.

[17]  Jorge Baptista,et al.  Automated anonymization of text documents , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[18]  Sunghwan Mac Kim,et al.  Finding Names in Trove: Named Entity Recognition for Australian Historical Newspapers , 2015, ALTA.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Jungo Kasai,et al.  Robust Multilingual Part-of-Speech Tagging via Adversarial Training , 2017, NAACL.

[21]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[22]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[23]  Thomas Demeester,et al.  Adversarial training for multi-context joint entity and relation extraction , 2018, EMNLP.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  Andrew M. Dai,et al.  Adversarial Training Methods for Semi-Supervised Text Classification , 2016, ICLR.

[26]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[27]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[28]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[29]  Walter Daelemans,et al.  Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-Gram Embeddings , 2017, BioNLP.

[30]  Andy Way,et al.  Using SMT for OCR Error Correction of Historical Texts , 2016, LREC.

[31]  Yang Song,et al.  Improving the Robustness of Deep Neural Networks via Stability Training , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Roland Vollgraf,et al.  FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP , 2019, NAACL.

[33]  I-Hung Hsu,et al.  Mitigating the impact of speech recognition errors on chatbot using sequence-to-sequence model , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[34]  Fabrizio Silvestri,et al.  Misspelling Oblivious Word Embeddings , 2019, NAACL.

[35]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[36]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[37]  Mickaël Coustaty,et al.  ICDAR 2019 Competition on Post-OCR Text Correction , 2017, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[38]  Jonas Kuhn,et al.  Multi-modular domain-tailored OCR post-correction , 2017, EMNLP.

[39]  Yong Cheng,et al.  Robust Neural Machine Translation with Doubly Adversarial Inputs , 2019, ACL.

[40]  Marcin Namysl,et al.  Efficient, Lexicon-Free OCR using Deep Learning , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[41]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[42]  R. Manmatha,et al.  Creating an Improved Version Using Noisy OCR from Multiple Editions , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[43]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[44]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[45]  Eric Nichols,et al.  Named Entity Recognition with Bidirectional LSTM-CNNs , 2015, TACL.

[46]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[47]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[48]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[49]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[50]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[51]  Christian Biemann,et al.  NoSta-D Named Entity Annotation for German: Guidelines and Dataset , 2014, LREC.

[52]  Mark Dredze,et al.  OOV Sensitive Named-Entity Recognition in Speech , 2011, INTERSPEECH.