On lexical resources for digitization of historical documents

Many European libraries are currently engaged in mass digitization projects that aim to make historical documents and corpora online available in the Internet. In this context, appropriate lexical resources play a double role. They are needed to improve OCR recognition of historical documents, which currently does not lead to satisfactory results. Second, even assuming a perfect OCR recognition, since historical language differs considerably from modern language, the matching process between queries submitted to search engines and variants of the search terms found in historical documents needs special support. While the usefulness of special dictionaries for both problems seems undisputed, concrete knowledge and experience are still missing. There are no hints about what optimal lexical resources for historical documents should look like. The real benefit reached by optimized lexical resources is unclear. Both questions are rather complex since answers depend on the point in history when documents were born. We present a series of experiments which illuminate these points. For our evaluations we collected a large corpus covering German historical documents from before 1500 to 1950 and constructed various types of dictionaries. We present the coverage reached with each dictionary for ten subperiods of time. Additional experiments illuminate the improvements for OCR accuracy and Information Retrieval that can be reached, again looking at distinct dictionaries and periods of time. For both OCR and IR, our lexical resources lead to substantial improvements.

[1]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[2]  Dawn Archer,et al.  VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora , 2005 .

[3]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[4]  Wolfram Luther,et al.  Comparison of distance measures for historical spelling variants , 2006, IFIP AI.

[5]  Norbert Fuhr,et al.  Rule-based Search in Text Databases with Nonstandard Orthography , 2006, Lit. Linguistic Comput..

[6]  Thomas M. Breuel,et al.  Image-Matching for Revision Detection in Printed Historical Documents , 2007, DAGM-Symposium.

[7]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[8]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[9]  Norbert Fuhr,et al.  Digital Historical Corpora - Architecture, Annotation, and Retrieval, 03.12. - 08.12.2006 , 2007, Digital Historical Corpora.

[10]  Klaus U. Schulz,et al.  Adaptive text correction with Web-crawled domain-dependent dictionaries , 2007, TSLP.

[11]  Rafael Giusti,et al.  Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary , 2007 .

[12]  Norbert Fuhr,et al.  Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[13]  Rose Holley How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[14]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..