Preliminary evaluation of histogram-based binarization algorithms

To date, most optical character recognition (OCR) systems process binary document images, and the quality of the input image strongly affects their performance. Since a binarization process is inherently lossy, different algorithms typically produce different binary images from the same gray scale image. The objective of this research is to study effects of global binarization algorithms on the performance of OCR systems. Several binarization methods were examined: the best fixed threshold value for the data set, the ideal histogram method, and Otsu's algorithm. Four contemporary OCR systems and 50 hard copy pages containing 91,649 characters were used in the experiments. These pages were digitized at 300 dpi and 8 bits/pixel, and 36 different threshold values (ranging from 59 to 199 in increments of 4) were used. The resulting 1,800 binary images were processed by all four OCR systems. All systems made approximately 40% more errors from images generated by Otsu's method than those of the ideal histogram method. Two of the systems made approximately the same number of errors from images generated by the best fixed threshold value and Otsu's method.

[1]  Anil K. Jain,et al.  Goal-Directed Evaluation of Binarization Methods , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  P.K Sahoo,et al.  A survey of thresholding techniques , 1988, Comput. Vis. Graph. Image Process..

[3]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[4]  George Nagy,et al.  Optical Scanning Digitizers , 1983, Computer.

[5]  Christopher J. P. Newton,et al.  Adaptive Thresholding for OCR: A Significant Test , 1993 .

[6]  Anil K. Jain,et al.  Segmentation of document images , 1989, SMC.

[7]  Azriel Rosenfeld,et al.  Threshold Evaluation Techniques , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Stephen V. Rice,et al.  The Third Annual Test of OCR Accuracy , 1994 .