High Performance Layout Analysis of Medieval European Document Images

Layout analysis, mainly including binarization and page segmentation, is one of the most important performance determining steps of an OCR system for complex medieval document images, which contain noise, distortions and irregular layouts. In this paper, we present high performance page segmentation techniques for medieval European document images which include a novel main-body and side-notes segregation and an improved version of OCRopus (OCRopus, ) based text line extraction. In order to complete the high performance layout analysis pipeline, we have also presented the application of the percentile based binarization (Afzal et al., 2014) and the multiresolution morphology based text and non-text segmentation (Bukhari et al., 2011) methods over historical document images. presented layout analysis techniques are applied to a collection of the 15th century Latin document images, which achieved more than 90% accuracy for each of the segmentation techniques.

[1]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[3]  Dan S. Bloomberg,et al.  Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[4]  C. V. Jawahar,et al.  On Segmentation of Documents in Complex Scripts , 2007 .

[5]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[6]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[7]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[8]  Syed Saqib Bukhari,et al.  Improved document image segmentation algorithm using multiresolution morphology , 2011, Electronic Imaging.

[9]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[10]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Syed Saqib Bukhari,et al.  Robust Binarization of Stereo and Monocular Document Images Using Percentile Filter , 2013, CBDAR.

[12]  Stefano Messelodi,et al.  Geometric Layout Analysis Techniques for Document Image Understanding: a Review , 2008 .

[13]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Syed Saqib Bukhari,et al.  Document image segmentation using discriminative learning over connected components , 2010, DAS '10.