Experiments on Urdu Text Recognition

Urdu is a language spoken in the Indian subcontinent by an estimated 130–270 million speakers. At the spoken level, Urdu and Hindi are considered dialects of a single language because of shared vocabulary and the similarity in grammar. At the written level, however, Urdu is much closer to Arabic because it is written in Nastaliq, the calligraphic style of the Persian–Arabic script. Therefore, a speaker of Hindi can understand spoken Urdu but may not be able to read written Urdu because Hindi is written in Devanagari script, whereas an Arabic writer can read the written words but may not understand the spoken Urdu. In this chapter we present an overview of written Urdu. Prior research in handwritten Urdu OCR is very limited. We present (perhaps) the first system for recognizing handwritten Urdu words. On a data set of about 1300 handwritten words, we achieved an accuracy of 70% for the top choice, and 82% for the top three choices.

[1]  Atif Gulzar Nastaleeq: A challenge accepted by Omega , 2007 .

[2]  Andrew Hardie,et al.  Developing a tagset for automated part-of-speech tagging in Urdu. , 2003 .

[3]  Sarmad Hussain,et al.  Resources for Urdu Language Processing , 2008, IJCNLP.

[4]  U. Pal,et al.  Recognition of printed Urdu script , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Anthony McEnery,et al.  EMILLE: towards a corpus of South Asian languages. , 2000 .

[6]  Behrooz Parhami,et al.  Automatic recognition of printed Farsi texts , 1981, Pattern Recognit..

[7]  Thesis TYPOLOGY OF WORD AND AUTOMATIC WORD SEGMENTATION IN URDU TEXT CORPUS , 2007 .

[8]  F. Shafait,et al.  Layout Analysis of Urdu Document Images , 2006, 2006 IEEE International Multitopic Conference.

[9]  Awais Adnan,et al.  OCR For Printed Urdu Script Using Feed Forward Neural Network , 2007 .

[10]  Sarmad Hussain,et al.  Context Sensitive Shape-Substitution in Nastaliq Writing System: Analysis and Formulation , 2007 .

[11]  Rohit Prasad,et al.  Performance improvements to the BBN Byblos OCR system , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[12]  S. A. Husain A multi-tier holistic approach for Urdu Nastaliq recognition , 2002 .

[13]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[14]  Richard M. Schwartz,et al.  A Script-Independent Methodology For Optical Character Recognition , 1998, Pattern Recognit..

[15]  Richard M. Schwartz,et al.  An Omnifont Open-Vocabulary OCR System for English and Arabic , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[17]  Venu Govindaraju,et al.  Offline Arabic handwriting recognition: a survey , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Sargur N. Srihari,et al.  Gradient-based contour encoding for character recognition , 1996, Pattern Recognit..

[19]  Mahmoud Reza Hashemi,et al.  Persian cursive script recognition , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.