Efficiently generating correction suggestions for garbled tokens of historical language

Text correction systems rely on a core mechanism where suitable correction suggestions for garbled input tokens are generated. Current systems, which are designed for documents including modern language, use some form of approximate search in a given background lexicon. Due to the large amount of spelling variation found in historical documents, special lexica for historical language can only offer restricted coverage. Hence historical language is often described in terms of a matching procedure to be applied to modern words. Given such a procedure and a base lexicon of modern words, the question arises of how to generate correction suggestions for garbled historical variants. In this paper we suggest an efficient algorithm that solves this problem. The algorithm is used for postcorrection of optical character recognition results on historical document collections.

[1]  D. R. McGregor,et al.  Fast approximate string matching , 1988, Softw. Pract. Exp..

[2]  Catherine C. Marshall,et al.  Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries , 2003 .

[3]  G. Navarro,et al.  Flexible Pattern Matching in Strings: Approximate matching , 2002 .

[4]  Horst Bunke,et al.  A fast algorithm for finding the nearest neighbor of a word in a dictionary , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[5]  Klaus U. Schulz,et al.  Enabling information retrieval on historical document collections: the role of matching procedures and special lexica , 2009, AND '09.

[6]  Norbert Fuhr,et al.  Rule-based Search in Text Databases with Nonstandard Orthography , 2006, Lit. Linguistic Comput..

[7]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[8]  Klaus U. Schulz,et al.  Fast Approximate Search in Large Dictionaries , 2004, CL.

[9]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[10]  Shourya Roy,et al.  Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data , 2009 .

[11]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[12]  Klaus U. Schulz,et al.  Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks , 2007 .

[13]  Klaus U. Schulz,et al.  Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[14]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[15]  Norbert Fuhr,et al.  Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[16]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[17]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[18]  Klaus U. Schulz,et al.  On lexical resources for digitization of historical documents , 2009, DocEng '09.

[19]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[20]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[21]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[22]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..