Post-Processing OCR Text using Web-Scale Corpora

We introduce a (semi-)automatic OCR post-processing system that utilizes web-scale linguistic corpora in providing high-quality correction. This paper is a comprehensive system overview with the focus on the computational procedures, applied linguistic analysis, and processing optimization.

[1]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[2]  Arthur G. Butler,et al.  Birds of Great Britain and Ireland, Order Passeres, complete in two volumes, by Arthur G. Butler. Illustrated by H. Grönvold and F.W. Frohawk. , 1907 .

[3]  Eric K. Ringger,et al.  How well does multiple OCR error correction generalize? , 2013, Electronic Imaging.

[4]  Evangelos E. Milios,et al.  Statistical Learning for OCR Text Correction , 2016, ArXiv.

[5]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6]  Julie Borsack,et al.  Expert system for automatically correcting OCR output , 1994, Electronic Imaging.

[7]  Günter Mühlberger,et al.  User-driven correction of OCR errors: combining crowdsourcing and information retrieval technology , 2014, DATeCH '14.

[8]  Nachum Dershowitz,et al.  OCR Error Correction Using Character Correction and Feature-Based Word Classification , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[9]  Diana Inkpen,et al.  Real-word spelling correction using Google web 1Tn-gram data set , 2009, CIKM.

[10]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[11]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[12]  G. Āllport The Psycho-Biology of Language. , 1936 .

[13]  Eric K. Ringger,et al.  Improving optical character recognition through efficient multiple system alignment , 2009, JCDL '09.

[14]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[15]  Eric K. Ringger,et al.  Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines , 2011, 2011 International Conference on Document Analysis and Recognition.

[16]  Arthur G. Butler,et al.  Birds of Great Britain and Ireland. , 1907 .