Improving Nastalique specific pre-recognition process for Urdu OCR

Urdu language is written using Arabic script in Nastalique writing style. Nastalique script is highly cursive, context sensitive and is hard to process as only the last character in its ligature sits on the baseline. In addition, it exhibits character and ligature level spatial overlap. Due to these factors, the placement of dots and other diacritics is also highly contextual and variable. There is now increasing amount of work to process and recognize Nastalique script to develop Urdu OCR. This paper proposes improvements to these methods. The paper focuses on Nastalique specific pre-processing methods which can be employed before the text recognition process. The recognition and post recognition processes will be addressed separately.

[1]  Sarmad Hussain,et al.  Urdu computing standards: Urdu Zabta Takhti (UZT) 1.01 , 2001, Proceedings. IEEE International Multi Topic Conference, 2001. IEEE INMIC 2001. Technology for the 21st Century..

[2]  U. Pal,et al.  Recognition of printed Urdu script , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  F. Shafait,et al.  Layout Analysis of Urdu Document Images , 2006, 2006 IEEE International Multitopic Conference.

[4]  S. A. Husain A multi-tier holistic approach for Urdu Nastaliq recognition , 2002 .

[5]  S.A. Khan,et al.  Urdu online handwriting recognition , 2005, Proceedings of the IEEE Symposium on Emerging Technologies, 2005..

[6]  Zeeshan Shafi Khan,et al.  Combining Offline and Online Preprocessing for Online Urdu Character Recognition , 2009 .

[7]  Fareeha Anwar,et al.  Relative Magnitude of Gaussian Curvature Using Neural Network and Object Rotation of Two Degrees of Freedom , 2007, MVA.

[8]  Sarmad Hussain,et al.  Context Sensitive Shape-Substitution in Nastaliq Writing System: Analysis and Formulation , 2007 .

[9]  Sarmad Hussain,et al.  Letter-to-Sound Conversion for Urdu Text-to-Speech System , 2004, COLING 2004.

[10]  Sarmad Hussain,et al.  Corpus Based Urdu Lexicon Development , 2007 .

[11]  Awais Adnan,et al.  OCR For Printed Urdu Script Using Feed Forward Neural Network , 2007 .

[12]  Chafic Mokbel,et al.  Arabic handwriting recognition using baseline dependant features and hidden Markov modeling , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[13]  Z. A. Shah,et al.  Ligature based optical character recognition of Urdu- Nastaleeq font , 2002 .

[14]  Volker Märgner,et al.  HMM based approach for handwritten arabic word recognition using the IFN/ENIT - database , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[15]  P. Adibi,et al.  NASTAALIGH HANDWRITTEN WORD RECOGNITION USING A CONTINUOUS-DENSITY VARIABLE-DURATION HMM , 2005 .

[16]  Awais Adnan,et al.  Urdu Nastaleeq Optical Character Recognition , 2007 .

[17]  Richard M. Schwartz,et al.  Robust language-independent OCR system , 1999, Other Conferences.

[18]  Marija Bojovic,et al.  Training of hidden Markov models for cursive handwritten word recognition , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[19]  A. J. Elms A connected character recogniser using level building of HMMs , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[20]  Nadir Durrani,et al.  A Study on Collation of Languages from Developing Asia , 2008 .