Urdu text classification using decision trees

This article reports the development and experimental analysis of an Urdu Optical Character REcognition (OCR) system. The proposed approach presents the preprocessing, features extraction and classification of Urdu language text. Three different features extraction techniques, the Hu moments, Zernike moments and the Principal Component Analysis (PCA) are used. Decision Tree algorithm J-48 is used for classification. A medium size database of 441 characters is created consisting of hand written and machine written Urdu language characters. An overall best recognition accuracy of 92.06% is achieved using the Hu moments.

[1]  Yoshihiro Shima,et al.  Optimal techniques in OCR error correction for Japanese texts , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  Mohammad S. Khorsheed,et al.  Off-Line Arabic Character Recognition – A Review , 2002, Pattern Analysis & Applications.

[3]  S. A. Husain A multi-tier holistic approach for Urdu Nastaliq recognition , 2002 .

[4]  Thomas H. Reiss,et al.  The revised Fundamental Theorem of Moment Invariants , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Srikanta Patnaik,et al.  Optical Character Recognition System for Urdu (Naskh Font) Using Pattern Matching Technique , 2009 .

[6]  J. Ross Quinlan,et al.  Decision trees and decision-making , 1990, IEEE Trans. Syst. Man Cybern..

[7]  Sarmad Hussain,et al.  Improving Nastalique specific pre-recognition process for Urdu OCR , 2009, 2009 IEEE 13th International Multitopic Conference.

[8]  U. Pal,et al.  Recognition of printed Urdu script , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  Nadir Durrani,et al.  Urdu Word Segmentation , 2010, NAACL.

[10]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[11]  Anil K. Jain,et al.  Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..

[12]  Ming-Kuei Hu,et al.  Visual pattern recognition by moment invariants , 1962, IRE Trans. Inf. Theory.

[13]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[14]  Masaki Nakagawa,et al.  'Online recognition of Chinese characters: the state-of-the-art , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  R. Mukundan,et al.  Moment Functions in Image Analysis: Theory and Applications , 1998 .

[16]  Awais Adnan,et al.  OCR For Printed Urdu Script Using Feed Forward Neural Network , 2007 .

[17]  Alex Pentland,et al.  Face recognition using eigenfaces , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  D. Menotti,et al.  Hu and Zernike Moments for Sign Language Recognition , 2012 .

[19]  Shehzad Khalid,et al.  Recognition of Urdu ligatures - a holistic approach , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).