Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High German. We show that multi-task learning with additional normalization data can improve our model's performance further.

[1]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[2]  Thomas M. Breuel,et al.  Normalizing historical orthography for OCR historical documents using LSTM , 2013, HIP '13.

[3]  Yves Scherrer,et al.  Modernizing historical Slovene words with character-based SMT , 2013, BSNLP@ACL.

[4]  John Nerbonne,et al.  Evaluating the Pairwise String Alignment of Pronunciations , 2009, LaTeCH - SHELT&R@EACL.

[5]  Bryan Jurish,et al.  More than Words: Using Token Context to Improve Canonicalization of Historical German , 2010, J. Lang. Technol. Comput. Linguistics.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Felipe Sánchez-Martínez,et al.  An open diachronic corpus of historical Spanish , 2013, Language Resources and Evaluation.

[8]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[9]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[10]  Marcel Bollmann,et al.  (Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool , 2012 .

[11]  Jörg Tiedemann,et al.  An SMT Approach to Automatic Annotation of Historical Text , 2013 .

[12]  Paul Rayson,et al.  VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .

[13]  Marilisa Amoia,et al.  Using Comparable Collections of Historical Texts for Building a Diachronic Dictionary for Spelling Normalization , 2013, LaTeCH@ACL.

[14]  Sigrid Klerke,et al.  Improving sentence compression by learning to predict gaze , 2016, NAACL.

[15]  Javier Gómez,et al.  Edit transducers for spelling variation in Old Spanish , 2013 .

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[18]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[19]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[20]  Stefanie Dipper,et al.  The Anselm Corpus: Methods and Perspectives of a Parallel Aligned Corpus , 2013 .

[21]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..