论文信息 - Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previously established normalization algorithms when evaluated on a diverse set of texts from Early New High German. We show that multi-task learning with additional normalization data can improve our model's performance further.

Anders Søgaard | Marcel Bollmann | Anders Søgaard | Marcel Bollmann

[1] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[2] Thomas M. Breuel,et al. Normalizing historical orthography for OCR historical documents using LSTM , 2013, HIP '13.

[3] Yves Scherrer,et al. Modernizing historical Slovene words with character-based SMT , 2013, BSNLP@ACL.

[4] John Nerbonne,et al. Evaluating the Pairwise String Alignment of Pronunciations , 2009, LaTeCH - SHELT&R@EACL.

[5] Bryan Jurish,et al. More than Words: Using Token Context to Improve Canonicalization of Historical German , 2010, J. Lang. Technol. Comput. Linguistics.

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Felipe Sánchez-Martínez,et al. An open diachronic corpus of historical Spanish , 2013, Language Resources and Evaluation.

[8] Philipp Koehn,et al. Synthesis Lectures on Human Language Technologies , 2016 .

[9] Michael Piotrowski,et al. Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[10] Marcel Bollmann,et al. (Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool , 2012 .

[11] Jörg Tiedemann,et al. An SMT Approach to Automatic Annotation of Historical Text , 2013 .

[12] Paul Rayson,et al. VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .

[13] Marilisa Amoia,et al. Using Comparable Collections of Historical Texts for Building a Diachronic Dictionary for Spelling Normalization , 2013, LaTeCH@ACL.

[14] Sigrid Klerke,et al. Improving sentence compression by learning to predict gaze , 2016, NAACL.

[15] Javier Gómez,et al. Edit transducers for spelling variation in Old Spanish , 2013 .

[16] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17] Quoc V. Le,et al. Multi-task Sequence to Sequence Learning , 2015, ICLR.

[18] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[19] Rich Caruana,et al. Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[20] Stefanie Dipper,et al. The Anselm Corpus: Methods and Perspectives of a Parallel Aligned Corpus , 2013 .

[21] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..