Layout Analysis for Arabic Historical Document Images Using Machine Learning

Page layout analysis is a fundamental step of any document image understanding system. We introduce an approach that segments text appearing in page margins (a.k.a side-notes text) from manuscripts with complex layout format. Simple and discriminative features are extracted in a connected-component level and subsequently robust feature vectors are generated. Multilayer perception classifier is exploited to classify connected components to the relevant class of text. A voting scheme is then applied to refine the resulting segmentation and produce the final classification. In contrast to state-of-the-art segmentation approaches, this method is independent of block segmentation, as well as pixel level analysis. The proposed method has been trained and tested on a dataset that contains a variety of complex side-notes layout formats, achieving a segmentation accuracy of about 95%.

[1]  Apostolos Antonacopoulos,et al.  Special issue on the analysis of historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[2]  Angelika Garz,et al.  Layout Analysis for Historical Manuscripts Using Sift Features , 2011, 2011 International Conference on Document Analysis and Recognition.

[3]  Abdel Belaïd,et al.  Multi-oriented Text Line Extraction from Handwritten Arabic Documents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[4]  Michal Strzelecki,et al.  Texture Analysis Methods - A Review , 1998 .

[5]  Matti Pietikäinen,et al.  A SURVEY OF TEXTURE-BASED METHODS FOR DOCUMENT LAYOUT ANALYSIS , 2000 .

[6]  Stefano Messelodi,et al.  Geometric Layout Analysis Techniques for Document Image Understanding: a Review , 2008 .

[7]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[8]  Anil K. Jain,et al.  Page segmentation using tecture analysis , 1996, Pattern Recognit..

[9]  Anil K. Jain,et al.  Document Structure and Layout Analysis , 2007 .

[10]  Syed Saqib Bukhari,et al.  Document image segmentation using discriminative learning over connected components , 2010, DAS '10.

[11]  Dan S. Bloomberg,et al.  Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[12]  Henry S. Baird,et al.  Segmentation-based retrieval of document images from diverse collections , 2008, Electronic Imaging.

[13]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[14]  Chee Sun Won Image extraction in digital documents , 2008, J. Electronic Imaging.

[15]  Lior Wolf,et al.  Automatically identifying join candidates in the Cairo Genizah , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[16]  Syed Saqib Bukhari,et al.  Improved document image segmentation algorithm using multiresolution morphology , 2011, Electronic Imaging.

[17]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).