Using Comparable Collections of Historical Texts for Building a Diachronic Dictionary for Spelling Normalization

In this paper, we argue that comparable collections of historical written resources can help overcoming typical challenges posed by heritage texts enhancing spelling normalization, POS-tagging and subsequent diachronic linguistic analyses. Thus, we present a comparable corpus of historical German recipes and show how such a comparable text collection together with the application of innovative MT inspired strategies allow us (i) to address the word form normalization problem and (ii) to automatically generate a diachronic dictionary of spelling variants. Such a diachronic dictionary can be used both for spelling normalization and for extracting new ”translation” (word formation/change) rules for diachronic spelling variants. Moreover, our approach can be applied virtually to any diachronic collection of texts regardless of the time span they represent. A first evaluation shows that our approach compares well with state-of-art approaches.

[1]  Klaus U. Schulz,et al.  Towards information retrieval on historical document collections: the role of matching procedures and special lexica , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[2]  Rafael Giusti,et al.  Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary , 2007 .

[3]  Stefanie Dipper,et al.  Applying Rule-Based Normalization to Different Types of Historical Texts - An Evaluation , 2011, LTC.

[4]  Andrea Wurm,et al.  Translatorische Wirkung : ein Beitrag zum Verständnis von Übersetzungsgeschichte als Kulturgeschichte am Beispiel deutscher Übersetzungen französischer Kochbücher in der Frühen Neuzeit , 2007 .

[5]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[6]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[7]  Iris Hendrickx,et al.  From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation , 2011, J. Lang. Technol. Comput. Linguistics.

[8]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[9]  Marcin Junczys-Dowmunt Influence of accurate compound noun splitting on bilingual vocabulary extraction , 2008, KONVENS.

[10]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..

[11]  Reinhard Rapp,et al.  Identifying Word Translations from Comparable Documents Without a Seed Lexicon , 2012, LREC.

[12]  Paul Bennett,et al.  A Gold Standard Corpus of Early Modern German , 2011, Linguistic Annotation Workshop.

[13]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[14]  Bryan Jurish Finding canonical forms for historical German text , 2008, KONVENS.

[15]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[16]  Stefanie Dipper POS-Tagging of Historical Language Data: First Experiments , 2010, KONVENS.