Automatic Selection of Binarization Method for Robust OCR

Many algorithms are now available for doing the same task (e.g. binarization, page segmentation, character recognition, etc.) in document image analysis (DIA) and choosing a particular algorithm(s) for a particular task is often a non-trivial problem. This paper proposes a model for automatically selecting the correct algorithm(s) for a given problem. Binarization has been taken a reference to illustrate the proposed approach. Several previously unexplored issues are addressed in this work. For example, only one method may not be good for the binarization of an entire document whereas a particular method may produce desired result for a particular region. Therefore, for a given document image, our model selects a set of one or more binarization techniques suitable for different regions of the document. This selection is completely automatic and guided by the machine learning approaches. Formulation of a completely automatic way for generating the annotated data for training the learning algorithms is also a novel contribution of this work. Evaluation of the approach is done using ICDAR 2003 Robust Reading data set and results highlight the potential of the proposed approach for automatic selection of correct DIA algorithm(s) from a set of several alternatives.

[1]  Lance Chun Che Fung,et al.  Optimal selection of binarization techniques for the processing of ancient palm leaf manuscripts , 2010, 2010 IEEE International Conference on Systems, Man and Cybernetics.

[2]  K. Hemachandran,et al.  Image Retrieval based on the Combination of Color Histogram and Color Moment , 2012 .

[3]  Wayne Niblack,et al.  An introduction to digital image processing , 1986 .

[5]  Aniruddha Sinha,et al.  Creation and Analysis of a Corpus of Text Rich Indian TV Videos , 2011, 2011 International Conference on Document Analysis and Recognition.

[6]  Ergina Kavallieratou,et al.  A Tool for Tuning Binarization Techniques , 2011, 2011 International Conference on Document Analysis and Recognition.

[7]  Jean-Michel Jolion,et al.  Text localization, enhancement and binarization in multimedia documents , 2002, Object recognition supported by user interaction for service robots.

[8]  Volker Märgner,et al.  Document Preprocessing System - Automatic Selection of Binarization , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[9]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[10]  Simon M. Lucas,et al.  ICDAR 2003 robust reading competitions , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Ioannis Pratikakis,et al.  ICDAR 2009 Document Image Binarization Contest (DIBCO 2009) , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[12]  N. Otsu A threshold selection method from gray level histograms , 1979 .