BCE-Arabic-v1 dataset: Towards interpreting Arabic document images for people with visual impairments

Millions of individuals in the Arab world have significant visual impairments that make it difficult for them to access printed text. Assistive technologies such as scanners and screen readers often fail to turn text into speech because optical character recognition software (OCR) has difficulty to interpret the textual content of Arabic documents. In this paper, we show that the inaccessibility of scanned PDF documents is in large part due to the failure of the OCR engine to understand the layout of an Arabic document. Arabic document layout analysis (DLA) is therefore an urgent research topic, motivated by the goal to provide assistive technology that serves people with visual impairments. We announce the launching of a large annotated dataset of Arabic document images, called BCE-Arabic-v1, to be used as a benchmark for DLA, OCR and text-to-speech research. Our dataset contains 1,833 images of pages scanned from 180 books and represents a variety of page content and layout, in particular, Arabic text in various fonts and sizes, photographs, tables, diagrams, and charts in single or multiple columns. We report the results of a formative study that investigated the performance of state-of-the-art document annotation tools. We found significant differences and limitations in the functionality and labeling speed of these tools, and selected the best-performing tool for annotating our benchmark BCE-Arabic-v1.

[1]  Fei Yin,et al.  CASIA Online and Offline Chinese Handwriting Databases , 2011, 2011 International Conference on Document Analysis and Recognition.

[2]  Venu Govindaraju,et al.  Document image analysis: A primer , 2002 .

[3]  Karim Hadjar,et al.  Physical Layout Analysis of Complex Structured Arabic Documents Using Artificial Neural Nets , 2004, Document Analysis Systems.

[4]  Apostolos Antonacopoulos,et al.  The PAGE (Page Analysis and Ground-Truth Elements) Format Framework , 2010, 2010 20th International Conference on Pattern Recognition.

[5]  Alfons Juan-Císcar,et al.  The GERMANA Database , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[6]  Syed Saqib Bukhari,et al.  High Performance Layout Analysis of Arabic and Urdu Document Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[7]  Sherif M. Abdou,et al.  A Combined Algorithm for Layout Analysis of Arabic Document Images and Text Lines Extraction , 2012 .

[8]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[9]  Marcus Liwicki,et al.  Ground truth model, tool, and dataset for layout analysis of historical documents , 2015, Electronic Imaging.

[10]  Abdulrahman Alarifi,et al.  Estimating the size of Arabic indexed web content , 2012 .

[11]  Jin Zhang,et al.  An empirical study of sentiment analysis for chinese documents , 2008, Expert Syst. Appl..

[12]  Apostolos Antonacopoulos,et al.  Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments , 2011, 2011 International Conference on Document Analysis and Recognition.

[13]  Karim Hadjar,et al.  Logical labeling of Arabic newspapers using artificial neural nets , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[14]  Hend Suliman Al-Khalifa,et al.  Investigating accessibility problems of Arabic PDF documents , 2013, Fourth International Conference on Information and Communication Technology and Accessibility (ICTA).

[15]  Lionel Prevost,et al.  Texture based Text Detection in Natural Scene Images - A Help to Blind and Visually Impaired Persons , 2007, CVHI.

[16]  Suranga Nanayakkara,et al.  FingerReader: a wearable device to support text reading on the go , 2014, CHI Extended Abstracts.

[17]  Jing Lin,et al.  PixLabeler: User Interface for Pixel-Level Labeling of Elements in Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[18]  Karim Hadjar,et al.  Arabic newspaper page segmentation , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[19]  Tapas Kanungo,et al.  The architecture of TrueViz: a groundTRUth/metadata editing and VIsualiZing ToolKit , 2003, Pattern Recognit..

[20]  Faisal Shafait,et al.  Geometric layout analysis of scanned documents , 2008 .

[21]  Alfons Juan-Císcar,et al.  The RODRIGO Database , 2010, LREC.

[22]  Walter S. Lasecki,et al.  Answering visual questions with conversational crowd assistants , 2013, ASSETS.