Finding canonical forms for historical German text

Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any technique or system requiring reference to a fixed lexicon accessed by orthographic form. This paper presents two methods for mapping unknown historical text types to one or more synchronically active canonical types: conflation by phonetic form, and conflation by lemma instantiation heuristics. Implementation details and evaluation of both methods are provided for a corpus of historical German verse quotation evidence from the digital edition of the Deutsches Worterbuch.

[1]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Peter Willett,et al.  A Comparison of Spelling-Correction Methods for the Identification of Word Forms in Historical Text Databases , 1993 .

[4]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[5]  Thomas Hanneforth,et al.  TAGH: A Complete Morphology for German Based on Weighted Finite State Automata , 2005, FSMNLP.

[6]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[7]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[10]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[11]  Bryan Jurish,et al.  A Hybrid Approach to Part-of-Speech Tagging , 2003 .

[12]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[13]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[14]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[15]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[16]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[17]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[18]  Marco Baroni,et al.  Unsupervised discovery of morphologically related words based on orthographic and semantic similarity , 2002, SIGMORPHON.

[19]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[20]  Steven J. DeRose,et al.  Grammatical Category Disambiguation by Statistical Optimization , 1988, CL.

[21]  Alfred V. Aho,et al.  The Theory of Parsing, Translation, and Compiling , 1972 .

[22]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.