(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool

Historical texts typically show a high degree of variance in spelling. Normalization of variant word forms to their modern spellings can greatly benefit further processing of the data, e.g., POS tagging or lemmatization. This paper compares several approaches to normalization with a focus on methods based on string distance measures and evaluates them on two different types of historical texts. Furthermore, the Norma tool is introduced, an interactive normalization tool which is flexibly adaptable to different varieties of historical language data. It is shown that a combination of normalization methods produces the best results, achieving an accuracy between 74% and 94% depending on the type of text.

[1]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[2]  Dawn Archer,et al.  Automatic Standardization of Spelling for Historical Text Mining , 2009 .

[3]  Paul Rayson,et al.  VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .

[4]  Stefanie Dipper,et al.  Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation , 2011, LTC.

[5]  Klaus U. Schulz,et al.  Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations , 2007 .

[6]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[7]  Bryan Jurish,et al.  More than Words: Using Token Context to Improve Canonicalization of Historical German , 2010, J. Lang. Technol. Comput. Linguistics.

[8]  Sebastian Kempken Bewertung historischer und regionaler Schreibvarianten mit Hilfe von Abstandsmaßen , 2014 .

[9]  Iris Hendrickx,et al.  From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation , 2011, J. Lang. Technol. Comput. Linguistics.

[10]  Gerlof Bouma,et al.  bokstaffua, bokstaffwa, bokstafwa, bokstaua, bokstawa ... Towards lexical link-up for a corpus of Old Swedish , 2012, KONVENS.

[11]  Justin Zobel,et al.  Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[12]  Paul Bennett,et al.  Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text , 2011, LaTeCH@ACL.

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .