论文信息 - Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.

Martin Reynaert | Martin Reynaert

[1] Peter Ingwersen,et al. Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[2] Klaus U. Schulz,et al. A visual and interactive tool for optimizing lexical postcorrection of OCR results , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[3] Daniel P. Lopresti. Performance evaluation for text processing of noisy inputs , 2005, SAC '05.

[4] M. de Rijke,et al. A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[5] Klaus U. Schulz,et al. Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[6] Norbert Fuhr,et al. Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[7] Peter Schneider,et al. Computer assisted spelling normalization of 18th century English , 2002 .

[8] Fred J. Damerau,et al. A technique for computer detection and correction of spelling errors , 1964, CACM.

[9] Martin Reynaert,et al. Text Induced Spelling Correction , 2004, COLING.

[10] Ulrich H. Frauenfelder,et al. Neighborhood Density and Frequency Across Languages and Modalities , 1993 .

[11] Klaus U. Schulz,et al. Tuning the Selection of Correction Candidates for Garbled Tokens using Error Dictionaries , 2007 .