Learning Texture Features for Enhancement and Segmentation of Historical Document Images

Many challenges and open issues related to the tremendous growth in digitizing collections of cultural heritage documents have been raised, such as information retrieval in digital libraries or analyzing page content of historical books. Recently, graphic/text segmentation in historical documents has posed specific challenges due to many particularities of historical document images (e.g. noise and degradation, presence of handwriting, overlapping layouts, great variability of page layout). To cope with those challenges, a method based on learning texture features for historical document image enhancement and segmentation is proposed in this article. The proposed method is based on using the simple linear iterative clustering (SLIC) superpixels, Gabor descriptors and support vector machines (SVM). It has been evaluated on 100 document images which have been selected from the databases of the competitions (i.e. historical document layout analysis and historical book recognition) in the context of ICDAR conference and HIP workshop (2011 and 2013). To demonstrate the enhancement and segmentation quality, the evaluation is based on manually labeled ground truth and shows the effectiveness of the proposed method through qualitative and numerical experiments. The proposed method provides interesting results on historical document images having various page layouts and different typographical and graphical properties.

[1]  Laurence Likforman-Sulem,et al.  Apport du traitement des images à la numérisation des documents manuscrits anciens , 2003, Document Numérique.

[2]  Rémy Mullot,et al.  Texture feature evaluation for segmentation of historical document images , 2013, HIP '13.

[3]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Rita Cucchiara,et al.  Layout analysis and content enrichment of digitized books , 2014, Multimedia Tools and Applications.

[5]  Jean-Yves Ramel,et al.  Document image characterization using a multiresolution analysis of the texture: application to old documents , 2008, International Journal of Document Analysis and Recognition (IJDAR).

[6]  Apostolos Antonacopoulos,et al.  A Realistic Dataset for Performance Evaluation of Document Layout Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[7]  C. Clausner,et al.  Historical Document Layout Analysis Competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  Rolf Ingold,et al.  Evaluation of SVM, MLP and GMM Classifiers for Layout Analysis of Historical Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[9]  Angelika Garz,et al.  Layout Analysis for Historical Manuscripts Using Sift Features , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  Laura Hoch,et al.  Introductory Digital Image Processing , 2016 .

[11]  Angelika Garz,et al.  A Combined System for Text Line Extraction and Handwriting Recognition in Historical Documents , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[12]  Apostolos Antonacopoulos,et al.  ICDAR 2013 Competition on Historical Book Recognition (HBR 2013) , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[13]  Jihad El-Sana,et al.  Robust text and drawing segmentation algorithm for historical documents , 2013, HIP '13.

[14]  Matti Pietikäinen,et al.  A SURVEY OF TEXTURE-BASED METHODS FOR DOCUMENT LAYOUT ANALYSIS , 2000 .

[15]  Koichi Kise,et al.  Page Segmentation Techniques in Document Analysis , 2014, Handbook of Document Image Processing and Recognition.

[16]  Robert Sablatnig,et al.  Recognition of Degraded Handwritten Characters Using Local Features , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[17]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Rémy Mullot,et al.  Performance Evaluation and Benchmarking of Six Texture-Based Feature Sets for Segmenting Historical Documents , 2014, 2014 22nd International Conference on Pattern Recognition.

[19]  Rémy Mullot,et al.  Robustness Assessment of Texture Features for the Segmentation of Ancient Documents , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[20]  Lena Vogler,et al.  Computer Processing Of Remotely Sensed Images An Introduction , 2016 .

[21]  David Doermann,et al.  Handbook of Document Image Processing and Recognition , 2014, Springer London.

[22]  Sunil Kumar,et al.  Text Extraction and Document Image Segmentation Using Matched Wavelets and MRF Model , 2007, IEEE Transactions on Image Processing.

[23]  Basilios Gatos,et al.  Page Segmentation Competition , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[24]  Henry S. Baird,et al.  The Convergence of Iterated Classification , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[25]  Paul M. Mather,et al.  Computer Processing of Remotely-Sensed Images: An Introduction , 1988 .

[26]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[27]  Jean-Daniel Fekete,et al.  Structuration des manuscrits : Du corpus à la région , 2004 .

[28]  Jean-Marc Ogier,et al.  Segmentation and Retrieval of Ancient Graphic Documents , 2005, GREC.

[29]  Jihad El-Sana,et al.  A Coarse-to-Fine Approach for Layout Analysis of Ancient Manuscripts , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.