OCR Post Processing Using Support Vector Machines

In this paper, we introduce a set of detailed experiment using Support Vector Machines (SVM) to try and improve accuracy selecting the correct candidate word to correct OCR generated errors. We use our alignment algorithm to create a one-to-one correspondence between the OCR text and the clean version of the TREC-5 data set (Confusion Track). We then extract five features from the candidates suggested by the Google web 1T corpus and use them to train and test our SVM model that will then generalize into the rest of the unseen text. We then improve on our initial results using a polynomial kernel, feature standardization with minmax normalization, and class balancing with SMOTE. Finally, we analyze the errors and suggest on future improvements.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[3]  Kazem Taghva,et al.  OCRSpell: an interactive spelling correction system for OCR errors in text , 2001, International Journal on Document Analysis and Recognition.

[4]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[5]  Saroj K. Biswas,et al.  Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance , 2017, Pattern Recognit. Lett..

[6]  Kazem Taghva,et al.  Aligning Ground Truth Text with OCR Degraded Text , 2019, Advances in Intelligent Systems and Computing.

[7]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[8]  Fonseca Cacho,et al.  Improving OCR Post Processing with Machine Learning Tools , 2019 .

[9]  Kazem Taghva,et al.  Reproducible Research in Document Analysis and Recognition , 2018 .

[10]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Kazem Taghva,et al.  Using the Google Web 1T 5-Gram Corpus for OCR Error Correction , 2019, 16th International Conference on Information Technology-New Generations (ITNG 2019).

[13]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[14]  Kazem Taghva,et al.  The State of Reproducible Research in Computer Science , 2020 .

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Kazem Taghva,et al.  Information access in the presence of OCR errors , 2004, HDP '04.

[17]  Emmanuel Bacry,et al.  tick: a Python Library for Statistical Learning, with an emphasis on Hawkes Processes and Time-Dependent Models , 2017, J. Mach. Learn. Res..