Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation

This paper deals with normalization of language data from Early New High German. We describe an unsupervised, rule-based approach which maps historical wordforms to modern wordforms. Rules are specified in the form of context-aware rewrite rules that apply to sequences of characters. They are derived from two aligned versions of the Luther bible and weighted according to their frequency. Applying the normalization rules to texts by Luther results in 91 % exact matches, clearly outperforming the baseline (65 %). Matches can be improved to 93 % by combining the approach with a word substitution list. If applied to more diverse language data from roughly the same period, performance goes down to 43 % exact matches (baseline: 35 %), and to 46 % using the combined method. The results show that rules derived from a highly different type of text can support normalization to a certain extent.

[1]  Yves Scherrer,et al.  Modernizing historical Slovene words with character-based SMT , 2013, BSNLP@ACL.

[2]  Bettina Schrader,et al.  Computing distance and relatedness of medieval text variants from German , 2008, KONVENS.

[3]  Bryan Jurish,et al.  More than Words: Using Token Context to Improve Canonicalization of Historical German , 2010, J. Lang. Technol. Comput. Linguistics.

[4]  Stefanie Dipper,et al.  Rule-Based Normalization of Historical Texts , 2011 .

[5]  Fabienne Braune,et al.  Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora , 2010, COLING.

[6]  Gerlof Bouma,et al.  bokstaffua, bokstaffwa, bokstafwa, bokstaua, bokstawa ... Towards lexical link-up for a corpus of Old Swedish , 2012, KONVENS.

[7]  Stefanie Dipper,et al.  Manual and semi-automatic normalization of historical spelling - case studies from Early New High German , 2012, KONVENS.

[8]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[9]  Javier Gómez,et al.  Edit transducers for spelling variation in Old Spanish , 2013 .

[10]  Esslli Site,et al.  Natural Language Processing for Historical Texts , 2012 .

[11]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[12]  Marcel Bollmann,et al.  POS Tagging for Historical Texts with Sparse Training Data , 2013, LAW@ACL.

[13]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[14]  Jörg Tiedemann,et al.  An SMT Approach to Automatic Annotation of Historical Text , 2013 .

[15]  Hans van Halteren,et al.  Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters , 2013, Lang. Resour. Evaluation.

[16]  Marcin Junczys-Dowmunt Influence of accurate compound noun splitting on bilingual vocabulary extraction , 2008, KONVENS.

[17]  Dawn Archer,et al.  Automatic Standardization of Spelling for Historical Text Mining , 2009 .

[18]  Marcel Bollmann,et al.  (Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool , 2012 .

[19]  NeyHermann,et al.  A systematic comparison of various statistical alignment models , 2003 .

[20]  Klaus U. Schulz,et al.  Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations , 2007 .