Learning document image binarization from data

We present a fully trainable solution for binarization of degraded document images using extremely randomized trees. Unlike previous attempts that often use simple features, our method encodes all heuristics about whether or not a pixel is foreground text into a high-dimensional feature vector and learns a more complicated decision function. We introduce two novel features, the Logarithm Intensity Percentile (LIP) and the Relative Darkness Index (RDI), and combine them with low level features, and reformulated features from existing binarization methods. Experimental results show that using small sample size (about 1.5% of all available training data), we can achieve a binarization performance comparable to manually-tuned, state-of-the-art methods. Additionally, the trained document binarization classifier shows good generalization capabilities on out-of-domain data.

[1]  Ioannis Pratikakis,et al.  ICDAR 2013 Document Image Binarization Contest (DIBCO 2013) , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[2]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[3]  Matti Pietikäinen,et al.  Adaptive document binarization , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[4]  Ioannis Pratikakis,et al.  ICDAR 2009 Document Image Binarization Contest (DIBCO 2009) , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  Raúl Rojas,et al.  Transition pixel: A concept for binarization based on edge detection and gray-intensity histograms , 2010, Pattern Recognit..

[6]  Shijian Lu,et al.  A learning framework for degraded document image binarization using Markov Random Field , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[7]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[8]  Nicholas R. Howe,et al.  A Laplacian Energy for Document Binarization , 2011, 2011 International Conference on Document Analysis and Recognition.

[9]  Ioannis Pratikakis,et al.  A combined approach for the binarization of handwritten document images , 2014, Pattern Recognit. Lett..

[10]  Frédéric Bouchara,et al.  FAIR: A Fast Algorithm for Document Image Restoration , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ioannis Pratikakis,et al.  ICFHR 2012 Competition on Handwritten Document Image Binarization (H-DIBCO 2012) , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[12]  Shijian Lu,et al.  Binarization of historical document images using the local maximum and minimum , 2010, DAS '10.

[13]  Ioannis Pratikakis,et al.  H-DIBCO 2010 - Handwritten Document Image Binarization Competition , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[14]  Mohamed Cheriet,et al.  Historical Document Binarization Based on Phase Information of Images , 2012, ACCV Workshops.

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Dirk Van,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[17]  Ioannis Pratikakis,et al.  ICFHR2014 Competition on Handwritten Document Image Binarization (H-DIBCO 2014) , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[18]  Nicholas R. Howe,et al.  Document binarization with automatic parameter tuning , 2013, International Journal on Document Analysis and Recognition (IJDAR).

[19]  Xiaoyang Tan,et al.  Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions , 2007, IEEE Transactions on Image Processing.

[20]  Shijian Lu,et al.  Document image binarization using background estimation and stroke edges , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[21]  Mohamed Cheriet,et al.  Phase-Based Binarization of Ancient Document Images: Model and Applications , 2014, IEEE Transactions on Image Processing.

[22]  Ioannis Pratikakis,et al.  ICDAR 2011 Document Image Binarization Contest (DIBCO 2011) , 2011, 2011 International Conference on Document Analysis and Recognition.

[23]  Shijian Lu,et al.  Robust Document Image Binarization Technique for Degraded Document Images , 2013, IEEE Transactions on Image Processing.

[24]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[25]  Mohamed Cheriet,et al.  Unsupervised Ensemble of Experts (EoE) Framework for Automatic Binarization of Document Images , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[26]  Xujun Peng,et al.  Text Extraction from Video Using Conditional Random Fields , 2011, 2011 International Conference on Document Analysis and Recognition.

[27]  Wayne Niblack,et al.  An introduction to digital image processing , 1986 .