Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.

[1]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[2]  Klaus U. Schulz,et al.  A visual and interactive tool for optimizing lexical postcorrection of OCR results , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[3]  Daniel P. Lopresti Performance evaluation for text processing of noisy inputs , 2005, SAC '05.

[4]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[5]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[6]  Norbert Fuhr,et al.  Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[7]  Peter Schneider,et al.  Computer assisted spelling normalization of 18th century English , 2002 .

[8]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[9]  Martin Reynaert,et al.  Text Induced Spelling Correction , 2004, COLING.

[10]  Ulrich H. Frauenfelder,et al.  Neighborhood Density and Frequency Across Languages and Modalities , 1993 .

[11]  Klaus U. Schulz,et al.  Tuning the Selection of Correction Candidates for Garbled Tokens using Error Dictionaries , 2007 .

[12]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[13]  Philip Resnik,et al.  OCR Post-Processing for Low Density Languages , 2005, HLT/EMNLP.

[14]  Norbert Fuhr,et al.  Rule-based Search in Text Databases with Nonstandard Orthography , 2006, Lit. Linguistic Comput..

[15]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[16]  Martin Reynaert Multilingual Text Induced Spelling Correction , 2004, COLING 2004.

[17]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[18]  Klaus U. Schulz,et al.  Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[19]  Martin Reynaert Corpus-Induced Corpus Clean-up , 2006, LREC.

[20]  R. Harald Baayen,et al.  The Effects of Lexical Specialization on the Growth Curve of the Vocabulary , 1996, Comput. Linguistics.

[21]  Marco Baroni,et al.  zipfR : word frequency distributions in R , 2007, ACL 2007.

[22]  Antonio Zamora,et al.  Collection and characterization of spelling errors in scientific and scholarly text , 1983, J. Am. Soc. Inf. Sci..

[23]  Klaus U. Schulz,et al.  Fast Approximate Search in Large Dictionaries , 2004, CL.