A Structure for Annotation and Ground-truthing of Urdu Handwritten Text Image Corpus

Abstract Over the last few decades, a large evolution has been made in the field of handwritten recognition. Material of handwritten documents is become less with current trends of digital electronics. However, for the investigation and research on a particular language a large volume of handwritten documents database is required. In this paper we describe our approach for development a large volume of Urdu handwritten text images Corpus on Urdu language. To make the database available in large field of Natural Language Processing we annotate database for each image and associate a XML based ground-truth Meta information to make it computer compatible as a linguistic resource. This paper focus on the some issue related with Corpus design and annotation such as data collection, writers selection, methodology of annotation etc.

[1]  Ching Y. Suen,et al.  A New Large Urdu Database for Off-Line Handwriting Recognition , 2009, ICIAP.

[2]  Stefan Knerr,et al.  The IRESTE On/Off (IRONOFF) dual handwriting database , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[3]  Subhadip Basu,et al.  CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[4]  B. Nethravathi,et al.  Creation of a Huge Annotated Database for Tamil and Kannada OHR , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[5]  Masaki Nakagawa,et al.  Collection of on-line handwritten Japanese character pattern databases and their analyses , 2004, Document Analysis and Recognition.

[6]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[7]  Hariharan Ravishankar,et al.  Offline handwritten word recognition in Hindi , 2012, DAR '12.