Unsupervised profiling of OCRed historical documents

In search engines and digital libraries, more and more OCRed historical documents become available. Still, access to these texts is often not satisfactory due to two problems: first, the quality of optical character recognition (OCR) on historical texts is often surprisingly low; second, historical spelling variation represents a barrier for search even if texts are properly reconstructed. As one step towards a solution we introduce a method that automatically computes a two-channel profile from an OCRed historical text. The profile includes (1) ''global'' information on typical recognition errors found in the OCR output, typical patterns for historical spelling variation, vocabulary and word frequencies in the underlying text, and (2) ''local'' hypotheses on OCR-errors and historical orthography of particular tokens of the OCR output. We argue that availability of this kind of knowledge represents a key step for improving OCR and Information Retrieval (IR) on historical texts: profiles can be used, e.g., to automatically finetune postcorrection systems or adapt OCR engines to the given input document, and to define refined models for approximate search that are aware of the kind of language variation found in a specific document. Our evaluation results show a strong correlation between the true distribution of spelling variation patterns and recognition errors in the OCRed text and estimated ranks and scores automatically computed in profiles. As a specific application we show how to improve the output of a commercial OCR engine using profiles in a postcorrection system.

[1]  Hildelies Balk,et al.  IMPACT: centre of competence in text digitisation , 2011, HIP '11.

[2]  George Nagy,et al.  Optical character recognition: an illustrated guide to the frontier , 1999, Electronic Imaging.

[3]  Alicia Fornés,et al.  Transcription alignment of Latin manuscripts using hidden Markov models , 2011, HIP '11.

[4]  Lon-Mu Liu,et al.  Adaptive post-processing of OCR text via knowledge acquisition , 1991, CSC '91.

[5]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[6]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[7]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[8]  Ray Smith Limits on the Application of Frequency-Based Language Models to OCR , 2011, 2011 International Conference on Document Analysis and Recognition.

[9]  Apostolos Antonacopoulos,et al.  Grid-based modelling and correction of arbitrarily warped historical document images for large-scale digitisation , 2011, HIP '11.

[10]  Rong Jin,et al.  Information retrieval for OCR documents: a content-based probabilistic correction model , 2003, IS&T/SPIE Electronic Imaging.

[11]  Achim Weigel,et al.  Lexical postprocessing by heuristic search and automatic determination of the edit costs , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[12]  Klaus U. Schulz,et al.  Deriving Symbol Dependent Edit Weights for Text Correction_The Use of Error Dictionaries , 2007 .

[13]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  William J. Byrne,et al.  A Generative Probabilistic OCR Model for NLP Applications , 2003, NAACL.

[15]  Bryan Jurish,et al.  More than Words: Using Token Context to Improve Canonicalization of Historical German , 2010, J. Lang. Technol. Comput. Linguistics.

[16]  Klaus U. Schulz,et al.  On lexical resources for digitization of historical documents , 2009, DocEng '09.

[17]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[18]  Klaus U. Schulz,et al.  Towards information retrieval on historical document collections: the role of matching procedures and special lexica , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[19]  Ulrich Reffle Efficiently generating correction suggestions for garbled tokens of historical language , 2011, Nat. Lang. Eng..

[20]  Sarah M. Greene,et al.  More than words: patients' views on apology and disclosure when things go wrong in cancer care. , 2013, Patient education and counseling.

[21]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[22]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[23]  Ashok C. Popat A panlingual anomalous text detector , 2009, DocEng '09.

[24]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[25]  Kazem Taghva,et al.  Information access in the presence of OCR errors , 2004, HDP '04.

[26]  Thomas L. Packer Performing information extraction to improve OCR error detection in semi-structured historical documents , 2011, HIP '11.