Lexicon-supported OCR of eighteenth century Dutch books: a case study

We report on a case study on OCR of eighteenth century books conducted in the IMPACT project. After introducing the IMPACT project and its approach to lexicon building and deployment, we zoom in to the application of IMPACT tools and data to the Dutch EDBO collection. The results are exemplified by detailed discussion of various practical options to improve text recognition beyond a baseline of running an uncustomized Finereader 10. In particular, we discuss improved recognition of long s.

[1]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..

[2]  Franz Guenthner Electronic Lexica and Corpora Research at CIS , 1996 .

[3]  Ichiro Fujinaga,et al.  Document Recognition for a Million Books , 2006, D Lib Mag..

[4]  Wolfram Luther,et al.  Comparison of distance measures for historical spelling variants , 2006, IFIP AI.

[5]  Kris Popat,et al.  N-gram language models for document image decoding , 2001, IS&T/SPIE Electronic Imaging.

[6]  Tomaz Erjavec,et al.  Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene , 2011, LaTeCH@ACL.

[7]  Norbert Fuhr,et al.  Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[8]  Apostolos Antonacopoulos,et al.  Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments , 2011, 2011 International Conference on Document Analysis and Recognition.

[9]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[10]  Ray Smith Limits on the Application of Frequency-Based Language Models to OCR , 2011, 2011 International Conference on Document Analysis and Recognition.

[11]  Klaus U. Schulz,et al.  Enabling information retrieval on historical document collections: the role of matching procedures and special lexica , 2009, AND '09.

[12]  Tomaž Erjavec,et al.  A lexicon for processing archaic language: the case of XIX , 2011 .

[13]  Klaus U. Schulz,et al.  Towards information retrieval on historical document collections: the role of matching procedures and special lexica , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[14]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[15]  Tomaž Erjavec,et al.  Towards a Lexicon of XIXth Century Slovene , 2010 .

[16]  Hugh Craig,et al.  Old spellings, new methods: automated procedures for indeterminate linguistic data , 2010, Lit. Linguistic Comput..

[17]  Klaus U. Schulz,et al.  On lexical resources for digitization of historical documents , 2009, DocEng '09.

[18]  Paolo Missier,et al.  An experimental workflow development platform for historical document digitisation and analysis , 2011, HIP '11.

[19]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[20]  Martin Volk,et al.  Reducing OCR Errors in Gothic-Script Documents , 2011, ERCIM News.