Adoptive Thresholding and Geometric Features based Physical Layout Analysis of Scanned Arabic Books

In the digital age, developing an automated system to convert old printed books into digital form is a challenging task. In this paper we propose a novel technique for the recognition of Arabic scanned documents both with normal and complex layouts. The proposed algorithm is based on the local adaptive thresholding and geometric features which according to the author’s knowledge is the first time it is applied to Arabic document image recognition based on the Physical Layout Analysis (PLA). The proposed method was applied to dataset consisting of 90 images collected from 700 books from various publishers and contains a total of 1112 zones; text zone, image zone, and graphic zone. The proposed algorithm achieved promising results with overall average recognition of 86.71% for Text and Image block regions for all three sets. The proposed novel algorithm outperforms the techniques mentioned in previous literature.

[1]  Margrit Betke,et al.  BCE-Arabic-v1 dataset: Towards interpreting Arabic document images for people with visual impairments , 2016, PETRA.

[2]  Jihad El-Sana,et al.  Layout Analysis for Arabic Historical Document Images Using Machine Learning , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[3]  V. Ginsburgh,et al.  Foreign Language Learning and Trade , 2017 .

[4]  Rémy Mullot,et al.  Old document image segmentation using the autocorrelation function and multiresolution analysis , 2013, Electronic Imaging.

[5]  Tae-Sun Choi,et al.  Local threshold and Boolean function based edge detection , 1999, 1999 Digest of Technical Papers. International Conference on Consumer Electronics (Cat. No.99CH36277).

[6]  Anil K. Jain,et al.  Document Structure and Layout Analysis , 2007 .

[7]  Ghazanfar Latif,et al.  An Online Numeral Recognition System Using Improved Structural Features – A Unified Method for Handwritten Arabic and Persian Numerals , 2017 .

[8]  Basilios Gatos,et al.  Page Segmentation Competition , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  Apostolos Antonacopoulos,et al.  Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  Karim Hadjar,et al.  Physical Layout Analysis of Complex Structured Arabic Documents Using Artificial Neural Nets , 2004, Document Analysis Systems.

[11]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[12]  Syed Saqib Bukhari,et al.  High Performance Layout Analysis of Arabic and Urdu Document Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[13]  Apostolos Antonacopoulos,et al.  The PAGE (Page Analysis and Ground-Truth Elements) Format Framework , 2010, 2010 20th International Conference on Pattern Recognition.

[14]  N. Otsu A threshold selection method from gray level histograms , 1979 .