Statistical learning for OCR error correction

Abstract Modern OCR engines incorporate some form of error correction, typically based on dictionaries. However, there are still residual errors that decrease performance of natural language processing algorithms applied to OCR text. In this paper, we present a statistical learning model for post-processing OCR errors, either in a fully automatic manner or followed by minimal user interaction to further reduce error rate. Our model employs web-scale corpora and integrates a rich set of linguistic features. Through an interdependent learning pipeline, our model produces and continuously refines the error detection and suggestion of candidate corrections. Evaluated on a historical biology book with complex error patterns, our model outperforms various baseline methods in the automatic mode and shows an even greater advantage when involving minimal user interaction. Quantitative analysis of each computational step further suggests that our proposed model is well-suited for handling volatile and complex OCR error patterns, which are beyond the capabilities of error correction incorporated in OCR engines.

[1]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[2]  Sophia Ananiadou,et al.  Customised OCR correction for historical medical text , 2015, 2015 Digital Heritage.

[3]  Rose Holley Many Hands Make Light Work : Public Collaborative OCR Text Correction in Australian Historic Newspapers , 2009 .

[4]  Murhaf Fares,et al.  Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes , 2013, CICLing.

[5]  Justin Tonra,et al.  Transcription maximized; expense minimized? Crowdsourcing and editing The Collected Works of Jeremy Bentham , 2012, Lit. Linguistic Comput..

[6]  M. Worboys,et al.  Text Mining the History of Medicine , 2016, PloS one.

[7]  Bryan Jurish,et al.  Word and Sentence Tokenization with Hidden Markov Models , 2013, J. Lang. Technol. Comput. Linguistics.

[8]  Günter Mühlberger,et al.  User-driven correction of OCR errors: combining crowdsourcing and information retrieval technology , 2014, DATeCH '14.

[9]  Diana Inkpen,et al.  Real-word spelling correction using Google Web 1T n-gram with backoff , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[10]  Xu Sun,et al.  A Large Scale Ranker-Based System for Search Query Spelling Correction , 2010, COLING.

[11]  Nachum Dershowitz,et al.  OCR Error Correction Using Character Correction and Feature-Based Word Classification , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[12]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[13]  Marcus Liwicki,et al.  Character-Level Alignment Using WFST and LSTM for Post-processing in Multi-script Recognition Systems - A Comparative Study , 2014, ICIAR.

[14]  Klaus U. Schulz,et al.  PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts , 2014, DATeCH '14.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Iyad Abu Doush,et al.  Improving post-processing optical character recognition documents with Arabic language using spelling error detection and correction , 2016, Int. J. Reason. based Intell. Syst..

[17]  Beatrice Alex,et al.  Estimating and rating the quality of optically character recognised text , 2014, DATeCH '14.

[18]  Johan Bos,et al.  Elephant: Sequence Labeling for Word and Sentence Segmentation , 2013, EMNLP.

[19]  Beatrice Alex,et al.  Digitised historical text: Does it have to be mediOCRe? , 2012, KONVENS.

[20]  Martin Reynaert On OCR ground truths and OCR post-correction gold standards, tools and formats , 2014, DATeCH '14.

[21]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[22]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[23]  Kazem Taghva,et al.  Fuzzy Information Extraction on OCR Text , 2011, 2011 Eighth International Conference on Information Technology: New Generations.

[24]  Yunyao Li,et al.  A Graph Approach to Spelling Correction in Domain-Centric Search , 2011, ACL.

[25]  Eric K. Ringger,et al.  Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines , 2011, 2011 International Conference on Document Analysis and Recognition.

[26]  Leonid Boytsov,et al.  Indexing methods for approximate dictionary searching: Comparative analysis , 2011, JEAL.

[27]  Antony J. Williams,et al.  Beautiful Data: The Stories Behind Elegant Data Solutions , 2009 .

[28]  Eric K. Ringger,et al.  Combining multiple thresholding binarization values to improve OCR output , 2013, Electronic Imaging.

[29]  Diana Inkpen,et al.  Correcting Different Types of Errors in Texts , 2011, Canadian Conference on AI.

[30]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[31]  Eric K. Ringger,et al.  How well does multiple OCR error correction generalize? , 2013, Electronic Imaging.

[32]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[33]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[34]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[35]  Daniel P. Lopresti Optical character recognition errors and their effects on natural language processing , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[36]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .

[37]  Youssef Bassil,et al.  Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information , 2012, Comput. Inf. Sci..

[38]  Gareth J. F. Jones,et al.  Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents , 2006, Inf. Process. Manag..

[39]  Eric K. Ringger,et al.  Improving optical character recognition through efficient multiple system alignment , 2009, JCDL '09.