Information Access to Historical Documents from the Early New High German Period

With the new interest in historical documents insight grew that electronic access to these texts causes many specific problems. In the first part of the paper we survey the present role of digital historical documents. After collecting central facts and observations on historical language change we comment on the difficulties that result for retrieval and data mining on historical texts. In the second part of the paper we report on our own work in the area with a focus on special matching strategies that help to relate modern language keywords with old variants. The basis of our studies is a collection of documents from the Early New High German period. These texts come with a very rich spectrum on word variants and spelling variations.

[1]  Klaus U. Schulz,et al.  Fast Approximate Search in Large Dictionaries , 2004, CL.

[2]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..

[3]  Norbert Fuhr,et al.  Rule-based Search in Text Databases with Nonstandard Orthography , 2006, Lit. Linguistic Comput..

[4]  Ioannis Pratikakis,et al.  A segmentation-free approach for keyword search in historical typewritten documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[5]  Justin Zobel,et al.  Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[6]  Werner Besch,et al.  Wortbildung des Frühneuhochdeutschen , 2000 .

[7]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[8]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[9]  Klaus U. Schulz,et al.  Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[10]  François Bry,et al.  Content-Aware DataGuides: Interleaving IR and DB Indexing Techniques for Efficient Retrieval of Textual XML Data , 2004, ECIR.

[11]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[12]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[13]  Dawn Archer,et al.  VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora , 2005 .

[14]  R. Manmatha,et al.  A search engine for historical manuscript images , 2004, SIGIR '04.

[15]  Klaus U. Schulz,et al.  A visual and interactive tool for optimizing lexical postcorrection of OCR results , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[16]  Werner Besch,et al.  Morphologie des Frühneuhochdeutschen , 2000 .

[17]  Werner Besch,et al.  Lexikologie und Lexikographie des Frühneuhochdeutschen , 2000 .

[18]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[20]  Oskar Reichmann,et al.  Syntax des Frühneuhochdeutschen , 2000 .