A Holistic Approach for Recognition of Complete Urdu Ligatures Using Hidden Markov Models

Optical Character Recognition (OCR) is one of the continuously explored problems. Presently, commercial character recognizers are available reporting near to 100% recognition rates on text in a number of scripts. Despite these advancements, OCR systems however, have yet to mature for cursive scripts like Urdu. This study presents a holistic technique for recognition of Urdu text in Nastaliq font using "complete" ligatures as recognition units. The term "complete" refers to a partial word including its main body and secondary components (dots and diacritic marks). Discrete Wavelet Transform (DWT) is employed as feature extractor while a separate Hidden Markov Model (HMM) is trained for each ligature considered in our study. More than 2000 frequently used unique Urdu ligatures from the standard CLE (Center of Language Engineering) dataset are considered in our evaluations. The system reads a promising accuracy of 88.87% on more than 10,000 partial words.

[1]  K. GovindanV.,et al.  Character recognitiona review , 1990 .

[2]  Srikanta Patnaik,et al.  Optical Character Recognition System for Urdu (Naskh Font) Using Pattern Matching Technique , 2009 .

[3]  Imran Siddiqi,et al.  Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks , 2016, Neurocomputing.

[4]  Jochen Triesch,et al.  Classification of hand postures against complex backgrounds using elastic graph matching , 2002, Image Vis. Comput..

[5]  Muhammad Imran Razzak,et al.  Evaluation of cursive and non-cursive scripts using recurrent neural networks , 2015, Neural Computing and Applications.

[6]  Imran Siddiqi,et al.  Urdu Nastaliq recognition using convolutional-recursive deep learning , 2017, Neurocomputing.

[7]  Ching Y. Suen,et al.  Holistic Urdu Handwritten Word Recognition Using Support Vector Machine , 2010, 2010 20th International Conference on Pattern Recognition.

[8]  Shehzad Khalid,et al.  Recognition of Urdu ligatures - a holistic approach , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[9]  Muhammad Imran Razzak,et al.  Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features , 2017, Neural Computing and Applications.

[10]  Shehzad Khalid,et al.  Segmentation-free optical character recognition for printed Urdu text , 2017, EURASIP J. Image Video Process..

[11]  Sarmad Hussain,et al.  Word Segmentation for Urdu OCR System , 2010 .

[12]  Imran Siddiqi,et al.  An Ocr system for printed Nasta'liq script: A segmentation based approach , 2014, 17th IEEE International Multi Topic Conference 2014.

[13]  Abdul Wahab,et al.  Optical character recognition system for Urdu , 2010, 2010 International Conference on Information and Emerging Technologies.

[14]  Inam Shamsher,et al.  Urdu compound Character Recognition using feed forward neural networks , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[15]  William P. Birmingham,et al.  Modeling Form for On-line Following of Musical Performances , 2005, AAAI.

[16]  Gurpreet Singh Lehal Choice of recognizable units for URDU OCR , 2012, DAR '12.

[17]  Surendra Ranganath,et al.  Real-time gesture recognition system and application , 2002, Image Vis. Comput..

[18]  Chafic Mokbel,et al.  Arabic handwritten document preprocessing and recognition , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[19]  Gernot A. Fink,et al.  Markov models for offline handwriting recognition: a survey , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[20]  U. Pal,et al.  Recognition of printed Urdu script , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[21]  Muhammad Abuzar Fahiem,et al.  Segmentation of Printed Urdu Scripts Using Structural Features , 2009, 2009 Second International Conference in Visualisation.

[22]  Sarmad Hussain,et al.  Segmentation Free Nastalique Urdu OCR , 2010 .

[23]  Junaid Tariq,et al.  Softconverter: A novel approach to construct OCR for printed Urdu isolated characters , 2010, 2010 2nd International Conference on Computer Engineering and Technology.

[24]  Faisal Shafait,et al.  A segmentation-free approach to Arabic and Urdu OCR , 2013, Electronic Imaging.

[25]  Awais Adnan,et al.  OCR For Printed Urdu Script Using Feed Forward Neural Network , 2007 .

[26]  Awais Adnan,et al.  Urdu Nastaleeq Optical Character Recognition , 2007 .

[27]  Sarmad Hussain,et al.  Segmentation Based Urdu Nastalique OCR , 2013, CIARP.

[28]  Sarmad Hussain,et al.  Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[29]  Saad Bin Ahmed,et al.  Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[30]  G. Kokkinakis,et al.  Handwritten character segmentation using transformation-based learning , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[31]  Mahmood K. Pathan,et al.  Nastaliq optical character recognition , 2008, ACM-SE 46.

[32]  Sarmad Hussain,et al.  Nastalique segmentation-based approach for Urdu OCR , 2015, International Journal on Document Analysis and Recognition (IJDAR).