Implementation Challenges for Nastaliq Character Recognition

Character recognition in cursive scripts or handwritten Latin script has attracted researchers’ attention recently and some research has been done in this area. Optical character recognition is the translation of optically-scanned bitmaps of printed or written text into digitally editable data files. OCRs developed for many world languages are already in use but none exists for Urdu Nastaliq – a calligraphic adaptation of the Arabic script, just as Jawi is for Malay. Urdu Nastaliq has 39 characters against Arabic 28. Each character then has 2-4 different shapes according to its position in the word: initial, medial, final and isolated. In Nastaliq, inter-word and intra-word overlapping makes optical recognition more complex. Character recognition of the Latin script is relatively easier. This paper reports research on Urdu Nastaliq OCR, discusses challenges and suggest a new solution for its implementation.

[1]  Michel Fanton Finite State Automata and Arabic Writing , 1998, SEMITIC@COLING.

[2]  W. F. Clocksin,et al.  Structural Features of Cursive Arabic Script , 1999, BMVC.

[3]  Muhammad Sarfraz,et al.  Offline Arabic text recognition system , 2003, 2003 International Conference on Geometric Modeling and Graphics, 2003. Proceedings.

[4]  Magdy A. Bayoumi,et al.  Arabic text recognition using neural networks , 1994, Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS '94.

[5]  Karim Hadjar,et al.  Arabic newspaper page segmentation , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6]  Xiaoli Yang,et al.  A comparative study of Fourier descriptors and Hu's seven moment invariants for image recognition , 2004, Canadian Conference on Electrical and Computer Engineering 2004 (IEEE Cat. No.04CH37513).

[7]  S.A. Khan,et al.  Urdu online handwriting recognition , 2005, Proceedings of the IEEE Symposium on Emerging Technologies, 2005..