OCR Post-Processing Text Correction using Simulated Annealing (OPTeCA)

This paper describes the system details and results of team “EOF” from the University of Melbourne for the shared task of ALTA 2017, which addresses the problem of text correction for post-processed Optical Character Recognition (OCR) based systems. We developed a two stage system which first detects errors in the given OCR post-processed text with the help of a support vector machine trained using given training dataset, followed by rectifying the errors by employing a confidencebased mechanism using simulated annealing to obtain an optimal correction from a pool of candidate corrections. Our system achieved a F1-score of 32.98% on the private leaderboard1, which is the best score among all the participating systems.

[1]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[2]  Eric K. Ringger,et al.  Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines , 2011, 2011 International Conference on Document Analysis and Recognition.

[3]  Eric K. Ringger,et al.  How well does multiple OCR error correction generalize? , 2013, Electronic Imaging.

[4]  Eric K. Ringger,et al.  Improving optical character recognition through efficient multiple system alignment , 2009, JCDL '09.

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Evangelos E. Milios,et al.  Statistical Learning for OCR Text Correction , 2016, ArXiv.

[7]  Steve Cassidy,et al.  Overview of the 2017 ALTA Shared Task: Correcting OCR Errors , 2017, ALTA.

[8]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[9]  Eric K. Ringger,et al.  Combining multiple thresholding binarization values to improve OCR output , 2013, Electronic Imaging.

[10]  Trevor I. Dix,et al.  A Bit-String Longest-Common-Subsequence Algorithm , 1986, Inf. Process. Lett..

[11]  Mark Alan Jones,et al.  A Probabilistic Parser and Its Application , 1992 .

[12]  Ching Y. Suen,et al.  The State of the Art in Online Handwriting Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[14]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[15]  Maya R. Gupta,et al.  OCR binarization and image pre-processing for searching historical documents , 2007, Pattern Recognit..