Semi-automated OCR database generation for Nabataean scripts

A large amount of real-world data is required to train and benchmark any character recognition algorithm. Developing a page-level ground-truth database for this purpose is overwhelmingly laborious, as it involves a lot of manual efforts to produce a reasonable database that covers all possible words of a language. Moreover, generating such a database for historical (degraded) documents or for a cursive script like Urdu1 is even more complex and grueling. The presented work attempts to solve this problem by proposing a semi-automated technique for generating ground-truth database. It is believed that the proposed automation will greatly reduce the manual efforts for developing any OCR database. The basic idea is to apply ligature-clustering prior to manual labeling. Two prototype datasets for Urdu script have been developed using the proposed technique and the results are also presented.

[1]  Thomas M. Breuel,et al.  Efficient implementation of local adaptive thresholding techniques using integral images , 2008, Electronic Imaging.

[2]  Adel M. Alimi,et al.  A New Arabic Printed Text Image Database and Evaluation Protocols , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[3]  V. Märgner,et al.  IfN / Farsi-Database : A Database of Farsi Handwritten City Names , 2008 .

[4]  Syed Saqib Bukhari,et al.  High Performance Layout Analysis of Arabic and Urdu Document Images , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  Stavros J. Perantonis,et al.  A Complete Optical Character Recognition Methodology for Historical Documents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[6]  Thomas M. Breuel,et al.  Automated OCR Ground Truth Generation , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[7]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[8]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[9]  Tapas Kanungo,et al.  Attributed point matching for automatic groundtruth generation , 2002, International Journal on Document Analysis and Recognition.

[10]  Robert M. Haralick,et al.  An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Ching Y. Suen,et al.  A New Large Urdu Database for Off-Line Handwriting Recognition , 2009, ICIAP.

[12]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..