Text Extraction for Historical Tibetan Document Images Based on Connected Component Analysis and Corner Point Detection

In this paper, we present a text extraction method for historical Tibetan document images. The task of text extraction is considered as text area detection and location problem. Firstly, the historical Tibetan document image is preprocessed to correct imbalanced illumination, tilt and noises, then get the binary image. Secondly, the regions of interest in historical Tibetan documents are divided into three categories using connected components. The images are divided equally into grids and the grids are filtered by the information of the categories of CCs and corner point density. The remaining grids are used to compute vertical and horizontal grid projections. Thirdly, by analyzing the projections, the approximate location of the text area can be detected. Finally, the text area is extracted accurately by correcting the bounding box of the approximate text area. Experiments on the dataset of historical Tibetan document images demonstrate the effectiveness of the proposed method.

[1]  Jin Jiang,et al.  Text Image with Complex Background Filtering Method Based on Harris Corner-point Detection , 2013, J. Softw..

[2]  Jihad El-Sana,et al.  Layout Analysis for Arabic Historical Document Images Using Machine Learning , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[3]  Vikas Yadav,et al.  Text Extraction in Document Images: Highlight on Using Corner Points , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[4]  Duc-Dung Nguyen,et al.  An adaptive over-split and merge algorithm for page segmentation , 2016, Pattern Recognit. Lett..

[5]  Hui Fu Text Extraction Based on Maximum-Minimum Similarity Training Method': Text Extraction Based on Maximum-Minimum Similarity Training Method' , 2008 .

[6]  Jean-Yves Ramel,et al.  User-driven page layout analysis of historical printed books , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Palaiahnakote Shivakumara,et al.  Text segmentation in degraded historical document images , 2016 .

[8]  Vijay Singh,et al.  Document layout analysis for Indian newspapers using contour based symbiotic approach , 2014, 2014 International Conference on Computer Communication and Informatics.

[9]  Elisa H. Barney Smith,et al.  Extending Page Segmentation Algorithms for Mixed-Layout Document Processing , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  Marcus Liwicki,et al.  Page segmentation of historical document images with convolutional autoencoders , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[11]  Marcus Liwicki,et al.  Page Segmentation for Historical Document Images Based on Superpixel Classification with Unsupervised Feature Learning , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).