User-Defined Expected Error Rate in OCR Postprocessing by Means of Automatic Threshold Estimation

In this work, a method for the automatic estimation of a threshold that allows the user of an OCR system to define an expected error rate is presented. When the OCR output is post-processed using a language model, a probability, a reliability index (or a “transformation cost”) is usually obtained, reflecting the likelihood (or its inverse) that the string of OCR hypotheses belongs to the model. Using a threshold on this index (or cost) to reject the less reliable hypotheses, a variable level of expected accuracy can be imposed on the output. It is much more convenient for the user the ability to “fix” at an acceptable level the expected error rate instead of having to deal with an arbitrary threshold. Of course, the result will always be high reject rates for difficult tasks and lower reject rates for easier tasks.

[1]  Philip Resnik,et al.  OCR Post-Processing for Low Density Languages , 2005, HLT/EMNLP.

[2]  Horst Bunke,et al.  Automatic estimation of the readability of handwritten text , 2008, 2008 16th European Signal Processing Conference.

[3]  Rafael Llobet,et al.  Rejection Threshold Estimation for an Unknown Language Model in an OCR Task , 2010, SSPR/SPR.

[4]  Rama Chellappa,et al.  Adaptive Threshold Estimation via Extreme Value Theory , 2010, IEEE Transactions on Signal Processing.

[5]  Alfons Juan-Císcar,et al.  Balancing error and supervision effort in interactive-predictive handwriting recognition , 2010, IUI '10.

[6]  Sargur N. Srihari,et al.  Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[8]  George F. Foster,et al.  Confidence estimation for NLP applications , 2006, TSLP.

[9]  B. Hansen Sample Splitting and Threshold Estimation , 2000 .

[10]  Ching Y. Suen,et al.  A Novel Rejection Measurement in Handwritten Numeral Recognition Based on Linear Discriminant Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[11]  Horst Bunke,et al.  Estimating the readability of handwritten text - a Support Vector Regression based approach , 2008, 2008 19th International Conference on Pattern Recognition.

[12]  Robert P. W. Duin,et al.  Precision-recall operating characteristic (P-ROC) curves in imprecise environments , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[13]  Erik G. Learned-Miller,et al.  Bounding the Probability of Error for High Precision Recognition , 2009, ArXiv.

[14]  Rafael Llobet,et al.  Stochastic error-correcting parsing for OCR post-processing , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[15]  Horst Bunke,et al.  Rejection strategies for offline handwritten text line recognition , 2006, Pattern Recognit. Lett..

[16]  Frédéric Bimbot,et al.  Techniques for a priori decision threshold estimation in speaker verification , 1998 .

[17]  Michael Perrone,et al.  Confidence modeling for handwriting recognition: algorithms and applications , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[18]  P.R. Chakravarthi,et al.  On determining the radar threshold for non-Gaussian processes from experimental data , 1996, IEEE Trans. Inf. Theory.