Learning string distance with smoothing for OCR spelling correction

Large databases of scanned documents (medical records, legal texts, historical documents) require natural language processing for retrieval and structured information extraction. Errors caused by the optical character recognition (OCR) system increase ambiguity of recognized text and decrease performance of natural language processing. The paper proposes OCR post correction system with parametrized string distance metric. The correction system learns specific error patterns from incorrect words and common sequences of correct words. A smoothing technique is proposed to assign non-zero probability to edit operations not present in the training corpus. Spelling correction accuracy is measured on database of OCR legal documents in English language. Language model and learning string metric with smoothing improves Viterbi-based search for the best sequence of corrections and increases performance of the spelling correction system.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[3]  Tinko Tinchev,et al.  Flexible Noisy Text Correction , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[4]  Günter Mühlberger,et al.  User-driven correction of OCR errors: combining crowdsourcing and information retrieval technology , 2014, DATeCH '14.

[5]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[6]  Pengcheng Gao,et al.  Fast Chinese calligraphic character recognition with large-scale data , 2014, Multimedia Tools and Applications.

[7]  Kent Fitch,et al.  Correcting noisy OCR: context beats confusion , 2014, DATeCH '14.

[8]  Mehryar Mohri Weighted Finite-State Transducer Algorithms. An Overview , 2004 .

[9]  M. Ramanan,et al.  A performance comparison and post-processing error correction technique to OCRs for printed Tamil texts , 2014, 2014 9th International Conference on Industrial and Information Systems (ICIIS).

[10]  Farooq Ahmad,et al.  Learning a Spelling Error Model from Search Query Logs , 2005, HLT.

[11]  Rafael Llobet,et al.  Using Field Interdependence to Improve Correction Performance in a Transducer-Based OCR Post-Processing System , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[12]  Jozef Juhar,et al.  Unsupervised Spelling Correction for Slovak , 2013 .

[13]  Yuan-Yuan Lv,et al.  Automatic Error Checking and Correction of Electronic Medical Records , 2015, FSDM.

[14]  Ole Tange,et al.  GNU Parallel 20150322 ('Hellwig') , 2015 .

[15]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[16]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Marc Sebban,et al.  Good edit similarity learning by loss minimization , 2012, Machine Learning.

[18]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[19]  Prasenjit Majumder,et al.  Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground Truth , 2016, Inf. Process. Manag..

[20]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[21]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[22]  Jozef Juhár,et al.  Classification of heterogeneous text data for robust domain-specific language modeling , 2014, EURASIP J. Audio Speech Music. Process..

[23]  Ho-Hyun Park,et al.  Error correction of reference indexing system including multimedia journals , 2014, Multimedia Tools and Applications.

[24]  Uwe Springmann,et al.  OCR of historical printings of Latin texts: problems, prospects, progress , 2014, DATeCH '14.

[25]  Stoyan Mihov,et al.  An approach to unsupervised historical text normalisation , 2014, DATeCH '14.

[26]  Ulrich Reffle,et al.  Unsupervised profiling of OCRed historical documents , 2013, Pattern Recognit..

[27]  Kazem Taghva,et al.  Utilizing web data in identification and correction of OCR errors , 2013, Electronic Imaging.

[28]  Hacene Belhadef,et al.  Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System , 2015 .

[29]  Douglas W. Oard,et al.  Evaluation of information retrieval for E-discovery , 2010, Artificial Intelligence and Law.

[30]  Atsuhiro Takasu Bayesian Similarity Model Estimation for Approximate Recognized Text Search , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[31]  Klaus U. Schulz,et al.  PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts , 2014, DATeCH '14.

[32]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[33]  Harald Sack,et al.  A framework for improved video text detection and recognition , 2014, Multimedia Tools and Applications.

[34]  Kai He,et al.  Multilingual corpus construction based on printed and handwritten character separation , 2015, Multimedia Tools and Applications.

[35]  Stoyan Mihov,et al.  Extraction of Spelling Variations from Language Structure for Noisy Text Correction , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[36]  Kazem Taghva,et al.  Post processing with first- and second-order hidden Markov models , 2013, Electronic Imaging.