Automatic segmentation and reconstruction of historical manuscripts in gradient domain

Separating content from noise in historical manuscripts is a fundamental task in digital palaeography. This study presents a fully automated segmentation approach based on the response of Harris corner detectors. The strength and clustering efficiency of the detected corners in the manuscripts are evaluated and used to segment the content from the background and noise. In addition, a manuscript reconstruction technique is proposed from the gradient field using the Poisson method to guide the interpolation. This reconstruction is able to remove noise significantly and hence enhances the contrast of the content thus making it easier for users to read and process these documents. The proposed approaches are evaluated using various standard databases to highlight their effectiveness and robustness to a multitude of noise and writing styles. Subjective and objective evaluations of the experimental results show that these techniques are able to successfully segment and reconstruct a very diverse set of scanned documents. An analysis of the results has also shown that the proposed technique compares favourably against similar counterparts.

[1]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[2]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[3]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[4]  Sargur N. Srihari,et al.  Understanding Handwritten Text in a Structured Environment: Determining ZIP Codes from Addresses , 1991, Int. J. Pattern Recognit. Artif. Intell..

[5]  Yalin Wang,et al.  Document zone content classification and its performance evaluation , 2006, Pattern Recognit..

[6]  Jianying Hu,et al.  Document classification using layout analysis , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[7]  Tien D. Bui,et al.  Text line segmentation in handwritten documents using Mumford-Shah model , 2009, Pattern Recognit..

[8]  Jean-Yves Ramel,et al.  Document image characterization using a multiresolution analysis of the texture: application to old documents , 2008, International Journal of Document Analysis and Recognition (IJDAR).

[9]  Graham Leedham,et al.  Preprocessing and presorting of envelope images for automatic sorting using OCR , 1990, Pattern Recognit..

[10]  C.V. Jawahar,et al.  Segmentation of Degraded Malayalam Words: Methods and Evaluation , 2011, 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics.

[11]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[12]  Ying Yang,et al.  A TaLISMAN: Automatic Text and LIne Segmentation of historical MANuscripts , 2014, GCH.

[13]  Georgios Louloudis,et al.  ICDAR 2009 Handwriting Segmentation Contest , 2009, ICDAR.

[14]  Paolo Frasconi,et al.  Hidden Tree Markov Models for Document Image Classification , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Jihad El-Sana,et al.  A Coarse-to-Fine Approach for Layout Analysis of Ancient Manuscripts , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[16]  Ioannis Pratikakis,et al.  ICFHR 2012 Competition on Handwritten Document Image Binarization (H-DIBCO 2012) , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[17]  Olarik Surinta,et al.  Image Segmentation of Historical Handwriting from Palm Leaf Manuscripts , 2008, Intelligent Information Processing.

[18]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  C. Clausner,et al.  Historical Document Layout Analysis Competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[20]  Angelika Garz,et al.  Layout Analysis for Historical Manuscripts Using Sift Features , 2011, 2011 International Conference on Document Analysis and Recognition.

[21]  Ahmed Bouridane,et al.  A corner strength based Fingerprint segmentation algorithm with dynamic thresholding , 2008, 2008 19th International Conference on Pattern Recognition.

[22]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[23]  Dorothea Blostein,et al.  A survey of document image classification: problem statement, classifier architecture and performance evaluation , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[24]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Jihad El-Sana,et al.  Robust text and drawing segmentation algorithm for historical documents , 2013, HIP '13.