A Quranic Dataset for Text Recognition

Any text recognition or Optical Character Recognition (OCR) system requires a dataset to learn how to recognize the text. Due to the lack of a standard benchmark, most of the studies in this field were conducted using private datasets without a fair comparison. In this work, we used the standard Mushaf al Madinah benchmark where there are some rules in writing style, for example, the page should start with the beginning of verse and end with the end of verse. Following these rules make the words vary in size and paragraphs on different pages. These characteristics making the recognition of the Quranic text more challenging than the normal Arabic text, where the state of the art systems fails to recognize the Quranic text. Therefore, Quranic OCR dataset is presented in this study. It contains 604 images on page level and 8927 images in text-line level. This dataset is public and free to use for the research community. The Quranic dataset would help the researchers in the field of Arabic OCR where the dataset produced in this study would be made public and free for the use of research purposes.

[1]  Rolf Ingold,et al.  A dataset for Arabic text detection, tracking and recognition in news videos- AcTiV , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[2]  Christophe Garcia,et al.  ALIF: A dataset for Arabic embedded text recognition in TV broadcast , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[3]  Sameh M. Awaidah,et al.  A multiple feature/resolution scheme to Arabic (Indian) numerals recognition using hidden Markov models , 2009, Signal Process..

[4]  Somaya Al-Máadeed,et al.  A data base for Arabic handwritten text recognition research , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[5]  Faisal Shafait,et al.  A segmentation-free approach to Arabic and Urdu OCR , 2013, Electronic Imaging.

[6]  Ahmed Bouridane,et al.  HACDB: Handwritten Arabic characters database for automatic character recognition , 2013, European Workshop on Visual Information Processing (EUVIP).

[7]  Mohamed Bahaj,et al.  On-line Handwritten Arabic Character Recognition using Artificial Neural Network , 2012 .

[8]  Bilal Bataineh,et al.  A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition , 2017 .

[9]  Rolf Ingold,et al.  Open Datasets and Tools for Arabic Text Detection and Recognition in News Video Frames , 2018, J. Imaging.

[10]  Samer Al-Kiswany,et al.  A new algorithm for Arabic optical character recognition , 2006 .

[11]  Hesham Hassan,et al.  QTID: Quran Text Image Dataset , 2018 .

[12]  Ching Y. Suen,et al.  Databases for recognition of handwritten Arabic cheques , 2003, Pattern Recognit..

[13]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[14]  Mohammad S. Khorsheed,et al.  Automatic Processing of Handwritten Arabic Forms using Neural Networks , 2005, IEC.

[15]  Mohammad Alshayeb,et al.  KHATT: An open Arabic offline handwritten text database , 2014, Pattern Recognit..