Urdu ligature recognition using multi-level agglomerative hierarchical clustering

Optical character recognition (OCR) system holds great significance in human-machine interaction. OCR has been the subject of intensive research especially for Latin, Chinese and Japanese script. Comparatively, little work has been done for Urdu OCR, due to the complexities and segmentation errors associated with its cursive script. This paper proposes an Urdu OCR system which aims at ligature-level recognition of Urdu text. This ligature based recognition approach overcomes the character-levelsegmentation problems associated with cursive scripts. A newly developed OCR algorithm is introduced that uses a semi-supervised multi-level clustering for categorization of the ligatures. Classification is performed using four machine learning techniques i.e. decision trees, linear discriminant analysis, naive Bayes and k-nearest neighbor (K-NN). The system was implemented and the results show 62, 61, 73 and 90% accuracy for decision tree, linear discriminant analysis, naive Bayes and K-NN respectively.

[1]  Sarmad Hussain Complexity of Asian Writing Systems : A Case Study of Nafees Nasta ’ leeq for Urdu , 2003 .

[2]  Rabab K. Ward,et al.  Character Recognition Systems for the non-expert , 1999 .

[3]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[4]  Samee Ullah Khan,et al.  The optical character recognition of Urdu-like cursive scripts , 2014, Pattern Recognit..

[5]  Abdullah Zawawi Talib,et al.  Printed Text Image Database for Sindhi OCR , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[6]  Z. A. Shah,et al.  Ligature based optical character recognition of Urdu- Nastaleeq font , 2002 .

[7]  S.A. Khan,et al.  Urdu online handwriting recognition , 2005, Proceedings of the IEEE Symposium on Emerging Technologies, 2005..

[8]  Sarmad Hussain,et al.  Segmentation Free Nastalique Urdu OCR , 2010 .

[9]  Fareeha Anwar,et al.  Relative Magnitude of Gaussian Curvature Using Neural Network and Object Rotation of Two Degrees of Freedom , 2007, MVA.

[10]  Sarmad Hussain,et al.  Font Size Independent OCR for Noori Nastaleeq , 2009 .

[11]  IMRAN KHAN PATHAN,et al.  Recognition of Offline Handwritten Isolated Urdu Character , 2012 .

[12]  Muhammad Sher,et al.  HMM and fuzzy logic: A hybrid approach for online Urdu script-based languages' character recognition , 2010, Knowl. Based Syst..

[13]  Sohail Abdul,et al.  A Finite State Model for Urdu Nastalique Optical Character Recognition , 2009 .

[14]  Sarmad Hussain,et al.  Context Sensitive Shape-Substitution in Nastaliq Writing System: Analysis and Formulation , 2007 .

[15]  Faisal Shafait,et al.  A segmentation-free approach to Arabic and Urdu OCR , 2013, Electronic Imaging.

[16]  Awais Adnan,et al.  OCR For Printed Urdu Script Using Feed Forward Neural Network , 2007 .

[17]  Sarmad Hussain,et al.  Segmentation Based Urdu Nastalique OCR , 2013, CIARP.

[18]  Muhammad Imran Razzak,et al.  Arabic script based character segmentation: A review , 2013, 2013 World Congress on Computer and Information Technology (WCCIT).

[19]  Joanna Isabelle Olszewska,et al.  Active contour based optical character recognition for automated scene understanding , 2015, Neurocomputing.

[20]  Awais Adnan,et al.  Urdu Nastaleeq Optical Character Recognition , 2007 .

[21]  Sarmad Hussain,et al.  Word Segmentation for Urdu OCR System , 2010 .

[22]  Tracy Hammond,et al.  Urdu Qaeda: Recognition System for Isolated Urdu Characters , 2009 .

[23]  Inam Shamsher,et al.  Urdu compound Character Recognition using feed forward neural networks , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[24]  Rehanullah Khan,et al.  An Efficient Method for Urdu Language Text Search in Image Based Urdu Text , 2012 .

[25]  U. Pal,et al.  Recognition of printed Urdu script , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[26]  Neeta Nain,et al.  A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[27]  Muhammad Imran Razzak,et al.  Evaluation of cursive and non-cursive scripts using recurrent neural networks , 2015, Neural Computing and Applications.

[28]  Rongrong Ji,et al.  Robust Optical Recognition of Cursive Pashto Script Using Scale, Rotation and Location Invariant Approach , 2015, PloS one.

[29]  Imran Siddiqi,et al.  Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks , 2016, Neurocomputing.

[30]  Muhammad Imran Razzak,et al.  FUZZY BASED PREPROCESSING USING FUSION OF ONLINE AND OFFLINE TRAIT FOR ONLINE URDU SCRIPT BASED LANGUAGES CHARACTER RECOGNITION , 2012 .

[31]  U. Pal,et al.  English, Devnagari and Urdu Text Identification , 2005 .

[32]  S. A. Husain A multi-tier holistic approach for Urdu Nastaliq recognition , 2002 .

[33]  Srikanta Patnaik,et al.  Optical Character Recognition System for Urdu (Naskh Font) Using Pattern Matching Technique , 2009 .