Document Image Clean-up and Binarization

Image binarization is a diicult task for documents with text over textured or shaded backgrounds, poor contrast, and/or considerable noise. Current optical character recognition (OCR) and document analysis technology do not handle such documents well. We have developed a simple yet eeective algorithm for document image clean-up and bina-rization. The algorithm consists of two basic steps. In the rst step, the input image is smoothed using a low-pass (Gaussian) lter. The smoothing operation enhances the text relative to any background texture. This is because background texture normally has higher frequency than text does. The smoothing operation also removes speckle noise. In the second step, the intensity histogram of the smoothed image is computed and a threshold automatically selected as follows. For black text, the rst peak of the histogram corresponds to text. Thresholding the image at the value of the valley between the rst and second peaks of the histogram binarizes the image well. In order to reliably identify the valley, the histogram is smoothed by a low-pass lter before the threshold is computed. The algorithm has been applied to some 50 images from a wide variety of sources: digitized video frames, photos, newspapers, advertisements in magazines or sales yers, personal checks, etc. There are 21820 characters and 4406 words in these images. 91% of the characters and 86% of the words are successfully cleaned up and binarized. A commercial OCR was applied to the binarized text when it consisted of fonts which were OCR recognizable. The recognition rate was 84% for the characters and 77% for the words. Any opinions, ndings and conclusions or recommendations expressed in this material are the author(s) and do not necessarily reeect those of the sponsors.

[1]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[2]  Proceedings of the IEEE , 2018, IEEE Journal of Emerging and Selected Topics in Power Electronics.

[3]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[4]  Andrew K. C. Wong,et al.  A new method for gray-level picture thresholding using the entropy of the histogram , 1985, Comput. Vis. Graph. Image Process..

[5]  Wen-Hsiang Tsai,et al.  Moment-preserving thresholding: a new approach , 1995 .

[6]  Chris A. Glasbey,et al.  An Analysis of Histogram-Based Thresholding Algorithms , 1993, CVGIP Graph. Model. Image Process..

[7]  Mohamed S. Kamel,et al.  Extraction of Binary Character/Graphics Images from Grayscale Document Images , 1993, CVGIP Graph. Model. Image Process..

[8]  Øivind Due Trier,et al.  Evaluation of Binarization Methods for Document Images , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Edward M. Riseman,et al.  Finding text in images , 1997, DL '97.

[10]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[11]  Azriel Rosenfeld,et al.  Histogram concavity analysis as an aid in threshold selection , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  S AbutablebAhmed Automatic thresholding of gray-level pictures using two-dimensional entropy , 1989 .

[13]  Lawrence O'Gorman Binarization and Multithresholding of Document Images Using Connectivity , 1994, CVGIP Graph. Model. Image Process..

[14]  Sargur N. Srihari,et al.  Document Image Binarization Based on Texture Features , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Qin Zhong,et al.  On minimum error thresholding and its implementations , 1988 .