A new hybrid binarization method based on Kmeans

The document binarization is a fundamental processing step toward Optical Character Recognition (OCR). It aims to separate the foreground text from the document background. In this article, we propose a novel binarization technique combining local and global approaches using the clustering algorithm Kmeans. The proposed Hybrid Binarization, based on Kmeans (HBK), performs a robust binarization on scanned documents. According to several experiments, we demonstrate that the HBK method improves the binarization quality while minimizing the amount of distortion. Moreover, it outperforms several well-known state of the art methods in the OCR evaluation.

[1]  Frédéric Bouchara,et al.  Super-Resolved Binarization of Text Based on the FAIR Algorithm , 2011, 2011 International Conference on Document Analysis and Recognition.

[2]  Sanjay Kumar Dubey,et al.  Comparative Analysis of K-Means and Fuzzy C- Means Algorithms , 2013 .

[3]  Thierry Géraud,et al.  Efficient multiscale Sauvola’s binarization , 2013, International Journal on Document Analysis and Recognition (IJDAR).

[4]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[5]  Matthieu Cord,et al.  Text segmentation in natural scenes using Toggle-Mapping , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[6]  Henry S. Baird,et al.  Whole-Book Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Stuart A. Roberts,et al.  New methods for the initialisation of clusters , 1996, Pattern Recognit. Lett..

[8]  GueeSang Lee,et al.  Binarization by Local K-means Clustering for Korean Text Extraction , 2008, 2008 IEEE International Symposium on Signal Processing and Information Technology.

[9]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[10]  Fernando Martín Rodríguez ANALYSIS TOOLS FOR GRAY LEVEL HISTOGRAMS , 2003 .

[11]  Ergina Kavallieratou A binarization algorithm specialized on document images and photos , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[12]  Umi Kalthum Ngah,et al.  Adaptive fuzzy moving K-means clustering algorithm for image segmentation , 2009, IEEE Transactions on Consumer Electronics.

[13]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[14]  Chris A. Glasbey,et al.  An Analysis of Histogram-Based Thresholding Algorithms , 1993, CVGIP Graph. Model. Image Process..

[15]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[16]  Hao Chen,et al.  Local-Global Image Binarization for Reconstructing the Cellular Structure of Polymer Foam Materials , 2013 .

[17]  E. Dubois,et al.  Digital picture processing , 1985, Proceedings of the IEEE.

[18]  Antoine Tabbone,et al.  Combining Global and Local Threshold to Binarize Document of Images , 2005, IbPRIA.

[19]  Nikos Papamarkos,et al.  An Evaluation Technique for Binarization Algorithms , 2008, J. Univers. Comput. Sci..

[20]  B. Kapralos,et al.  I An Introduction to Digital Image Processing , 2022 .

[21]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[22]  Toru Wakahara,et al.  Binarization of Color Characters in Scene Images Using k-means Clustering and Support Vector Machines , 2010, 2010 20th International Conference on Pattern Recognition.

[23]  Ehsanollah Kabir,et al.  An adaptive water flow model for binarization of degraded document images , 2012, International Journal on Document Analysis and Recognition (IJDAR).

[24]  Roberto Paredes,et al.  A Hybrid Binarization Technique for Document Images , 2011, Learning Structure and Schemas from Documents.

[25]  Jean-Michel Jolion,et al.  Extraction and recognition of artificial text in multimedia documents , 2003, Formal Pattern Analysis & Applications.

[26]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[27]  Dimple Malik,et al.  Evolving limitations in K-means algorithm in data mining and their removal , 2011 .

[28]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[29]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[30]  Pritesh Vora,et al.  A Survey on K-mean Clustering and Particle Swarm Optimization , 2013 .