A Large-Scale Comparison of Historical Text Normalization Systems

There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.

[1]  Eva Pettersson,et al.  Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction , 2016 .

[2]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[3]  Iñaki Alegria,et al.  Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene , 2016, LREC.

[4]  Hans van Halteren,et al.  Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters , 2013, Lang. Resour. Evaluation.

[5]  Javier Gómez,et al.  Edit transducers for spelling variation in Old Spanish , 2013 .

[6]  Anders Søgaard,et al.  Improving historical spelling normalization with bi-directional LSTMs and multi-task learning , 2016, COLING.

[7]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[8]  Rafael Giusti,et al.  Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary , 2007 .

[9]  Yves Scherrer,et al.  Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation , 2016, KONVENS.

[10]  Wolfram Luther,et al.  Comparison of distance measures for historical spelling variants , 2006, IFIP AI.

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Walter Daelemans,et al.  Lemmatization for variation-rich languages using deep learning , 2016, Digit. Scholarsh. Humanit..

[13]  Marilisa Amoia,et al.  Using Comparable Collections of Historical Texts for Building a Diachronic Dictionary for Spelling Normalization , 2013, LaTeCH@ACL.

[14]  Jörg Tiedemann,et al.  An SMT Approach to Automatic Annotation of Historical Text , 2013 .

[15]  Paul Rayson,et al.  VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .

[16]  Yves Scherrer,et al.  Modernising historical Slovene words , 2015, Natural Language Engineering.

[17]  Joachim Bingel,et al.  Multi-task learning for historical text normalization: Size matters , 2018, DeepLo@ACL.

[18]  Marcel Bollmann,et al.  (Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool , 2012 .

[19]  Sigrún Helgadóttir,et al.  The Tagged Icelandic Corpus (MÍM) , 2012 .

[20]  Klaus U. Schulz,et al.  Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations , 2007 .

[21]  Joakim Nivre,et al.  Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting , 2013, NODALIDA.

[22]  Peter Willett,et al.  A Comparison of Spelling-Correction Methods for the Identification of Word Forms in Historical Text Databases , 1993 .

[23]  Stoyan Mihov,et al.  An approach to unsupervised historical text normalisation , 2014, DATeCH '14.

[24]  Walter Daelemans,et al.  Weigh your words - memory-based lemmatization for Middle Dutch , 2010, Lit. Linguistic Comput..

[25]  Matthias Sperber,et al.  XNMT: The eXtensible Neural Machine Translation Toolkit , 2018, AMTA.

[26]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[27]  Joachim Bingel,et al.  Learning attention for historical text normalization by learning to pronounce , 2017, ACL.

[28]  Yves Scherrer,et al.  Modernizing historical Slovene words with character-based SMT , 2013, BSNLP@ACL.

[29]  Sharon Goldwater,et al.  Evaluating Historical Text Normalization Systems: How Well Do They Generalize? , 2018, NAACL.

[30]  Joakim Nivre,et al.  An Evaluation of Neural Machine Translation Models on Historical Spelling Normalization , 2018, COLING.

[31]  Natalia Korchagina Normalizing Medieval German Texts: from rules to deep learning , 2017, ListLang@NoDaLiDa.

[32]  Gerlof Bouma,et al.  bokstaffua, bokstaffwa, bokstafwa, bokstaua, bokstawa ... Towards lexical link-up for a corpus of Old Swedish , 2012, KONVENS.

[33]  Marcel Bollmann,et al.  Normalization of historical texts with neural network models , 2018 .

[34]  Francisco Casacuberta Nolla,et al.  Spelling Normalization of Historical Documents by Using a Machine Translation Approach , 2018, EAMT.

[35]  Joakim Nivre,et al.  A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text , 2014, LaTeCH@EACL.

[36]  André F. T. Martins,et al.  Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[37]  Felipe Sánchez-Martínez,et al.  An open diachronic corpus of historical Spanish , 2013, Language Resources and Evaluation.

[38]  Gerold Schneider,et al.  Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts , 2017, ListLang@NoDaLiDa.

[39]  Bryan Jurish,et al.  More than Words: Using Token Context to Improve Canonicalization of Historical German , 2010, J. Lang. Technol. Comput. Linguistics.

[40]  Stefanie Dipper,et al.  Rule-Based Normalization of Historical Texts , 2011 .

[41]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[42]  Hans Fix Automatische Normalisierung - Vorarbeit zur Lemmatisierung eines diplomatischen altisländischen Textes , 1980 .

[43]  Rico Sennrich,et al.  The University of Edinburgh’s Neural MT Systems for WMT17 , 2017, WMT.

[44]  Dawn Archer,et al.  VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora , 2005 .

[45]  Thomas M. Breuel,et al.  Normalizing historical orthography for OCR historical documents using LSTM , 2013, HIP '13.

[46]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[47]  Bryan Jurish,et al.  Comparing Canonicalizations of Historical German Text , 2010, SIGMORPHON.

[48]  Fabian Barteld,et al.  Unsupervised regularization of historical texts for POS tagging , 2016 .

[49]  Jörg Tiedemann,et al.  Normalizing Early English Letters to Present-day English Spelling , 2018, LaTeCH@COLING.

[50]  Tomaž Erjavec,et al.  Normalising Slovene data: historical texts vs. user-generated content , 2016, KONVENS.

[51]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .