Urdu Optical Character Recognition Systems: Present Contributions and Future Directions

This paper gives an across-the-board comprehensive review and survey of the most prominent studies in the field of Urdu optical character recognition (OCR). This paper introduces the OCR technology and presents a historical review of the OCR systems, providing comparisons between the English, Arabic, and Urdu systems. Detailed background and literature have also been provided for Urdu script, discussing the script’s past, OCR categories, and phases. This paper further reports all state-of-the-art studies for different phases, namely, image acquisition, pre-processing, segmentation, feature extraction, classification/recognition, and post-processing for an Urdu OCR system. In the segmentation section, the analytical and holistic approaches for Urdu text have been emphasized. In the feature extraction section, a comparison has been provided between the feature learning and feature engineering approaches. Deep learning and traditional machine learning approaches have been discussed. The Urdu numeral recognition systems have also been deliberated concisely. The research paper concludes by identifying some open problems and suggesting some future directions.

[1]  Simon J. Doran,et al.  Stacked Autoencoders for Unsupervised Feature Learning and Multiple Organ Detection in a Pilot Study Using 4D Patient Data , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Reema Qaiser Khan,et al.  Urdu optical character recognition technique using point feature matching; a generic approach , 2015, 2015 International Conference on Information and Communication Technologies (ICICT).

[3]  Sarmad Hussain,et al.  Improving Nastalique specific pre-recognition process for Urdu OCR , 2009, 2009 IEEE 13th International Multitopic Conference.

[4]  Chetan Nagar,et al.  Recognize Handwritten Urdu Script Using Kohenen Som Algorithm , 2012 .

[5]  S. Impedovo,et al.  Optical Character Recognition - a Survey , 1991, Int. J. Pattern Recognit. Artif. Intell..

[6]  Khalil Khan,et al.  Urdu Character Recognition using Principal Component Analysis , 2012 .

[7]  Muhammad Imran Razzak,et al.  Evaluation of cursive and non-cursive scripts using recurrent neural networks , 2015, Neural Computing and Applications.

[8]  Muhammad Imran Razzak,et al.  Handwritten Urdu character recognition using one-dimensional BLSTM classifier , 2017, Neural Computing and Applications.

[9]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[10]  Noman Islam,et al.  A Survey on Optical Character Recognition System , 2017, ArXiv.

[11]  Bülent Sankur,et al.  Survey over image thresholding techniques and quantitative performance evaluation , 2004, J. Electronic Imaging.

[12]  Sarmad Hussain,et al.  Nastalique segmentation-based approach for Urdu OCR , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[13]  Imran Siddiqi,et al.  Urdu Nastaliq recognition using convolutional-recursive deep learning , 2017, Neurocomputing.

[14]  Tony McEnery,et al.  EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation , 2002, LREC.

[15]  Anil K. Jain,et al.  Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..

[16]  Mouhcine Rabi,et al.  Recognition of Cursive Arabic Handwritten Text Using Embedded Training Based on Hidden Markov Models , 2018, Int. J. Pattern Recognit. Artif. Intell..

[17]  Muhammad Imran Razzak,et al.  Urdu Nasta’liq text recognition using implicit segmentation based on multi-dimensional long short term memory neural networks , 2016, SpringerPlus.

[18]  Nafiz Arica,et al.  An overview of character recognition focused on off-line handwriting , 2001, IEEE Trans. Syst. Man Cybern. Syst..

[19]  Engr. Reema Qaiser Khan,et al.  Urdu Optical Character Recognition Technique for Jameel Noori Nastaleeq Script , 2015 .

[20]  Gurpreet Singh Lehal,et al.  Recognition of Nastalique Urdu ligatures , 2013, MOCR '13.

[21]  Safdar Zaman,et al.  A Self Organizing Map based Urdu Nasakh character recognition , 2009, 2009 International Conference on Emerging Technologies.

[22]  Shehzad Khalid,et al.  Recognition of Urdu ligatures - a holistic approach , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[23]  Xiaojie Wang,et al.  Offline Urdu Nastaleeq optical character recognition based on stacked denoising autoencoder , 2017, China Communications.

[24]  Herbert F. Schantz,et al.  History of OCR, Optical Character Recognition , 1982 .

[25]  Thomas M. Breuel,et al.  Efficient implementation of local adaptive thresholding techniques using integral images , 2008, Electronic Imaging.

[26]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[27]  Venu Govindaraju,et al.  Offline Arabic handwriting recognition: a survey , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  U. Pal,et al.  Recognition of printed Urdu script , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[29]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[30]  Muhammad Sarim,et al.  Offline Urdu Numeral Recognition Using Non-Negative Matrix Factorization , 2014 .

[31]  Ali Alkhalifah,et al.  Urdu text classification using decision trees , 2015, 2015 12th International Conference on High-capacity Optical Networks and Enabling/Emerging Technologies (HONET).

[32]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[33]  Ali Daud,et al.  Urdu language processing: a survey , 2017, Artificial Intelligence Review.

[34]  F. Shafait,et al.  Layout Analysis of Urdu Document Images , 2006, 2006 IEEE International Multitopic Conference.

[35]  Guang Liu,et al.  Ligature based Urdu Nastaleeq sentence recognition using gated bidirectional long short term memory , 2017, Cluster Computing.

[36]  Muhammad Imran Razzak,et al.  Arabic script based language character recognition: Nasta'liq vs Naskh analysis , 2013, 2013 World Congress on Computer and Information Technology (WCCIT).

[37]  Fareeha Anwar,et al.  Relative Magnitude of Gaussian Curvature Using Neural Network and Object Rotation of Two Degrees of Freedom , 2007, MVA.

[38]  Sarmad Hussain,et al.  Font Size Independent OCR for Noori Nastaleeq , 2009 .

[39]  Tracy Hammond,et al.  Urdu Qaeda: Recognition System for Isolated Urdu Characters , 2009 .

[40]  Imran Siddiqi,et al.  Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey , 2015, Education and Information Technologies.

[41]  Aejaz Farooq Ganai,et al.  Projection profile based ligature segmentation of Nastaleeq Urdu OCR , 2016, 2016 4th International Symposium on Computational and Business Intelligence (ISCBI).

[42]  Inam Shamsher,et al.  Urdu compound Character Recognition using feed forward neural networks , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[43]  Faiza Iqbal,et al.  Conversion of urdu nastaliq to roman urdu using OCR , 2011, The 4th International Conference on Interaction Sciences.

[44]  Yunus Khan,et al.  Handwritten Urdu Character Recognition Using Zernike MI’s Feature Extraction and Support Vector Machine Classifier , 2014 .

[45]  Imran Siddiqi,et al.  Optical Character Recognition System for Urdu Words in Nastaliq Font , 2016 .

[46]  V. K. Govindan,et al.  Character recognition - A review , 1990, Pattern Recognit..

[47]  Sarmad Hussain,et al.  Corpus Based Urdu Lexicon Development , 2007 .

[48]  Ramin Mehran,et al.  A Front-End OCR for Omni-Font Persian/Arabic Cursive Printed Documents , 2005, Digital Image Computing: Techniques and Applications (DICTA'05).

[49]  Xiaojie Wang,et al.  Line and Ligature Segmentation of Urdu Nastaleeq Text , 2017, IEEE Access.

[50]  Mahmood K. Pathan,et al.  Nastaliq optical character recognition , 2008, ACM-SE 46.

[51]  Muhammad Waqas Anwar,et al.  Printed Urdu Nastalique Script Recognition Using Analytical Approach , 2015, 2015 13th International Conference on Frontiers of Information Technology (FIT).

[52]  Fei-Yue Wang,et al.  Traffic Flow Prediction With Big Data: A Deep Learning Approach , 2015, IEEE Transactions on Intelligent Transportation Systems.

[53]  Mohammad S. Khorsheed,et al.  Off-Line Arabic Character Recognition – A Review , 2002, Pattern Analysis & Applications.

[54]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[55]  Dalila B. Megherbi,et al.  Two-stage neural-network-based technique for Urdu character two-dimensional shape representation, classification, and recognition , 2001, SPIE Defense + Commercial Sensing.

[56]  Muhammad Muzammal,et al.  Online Urdu Handwriting Recognition System Using Geometric Invariant Features , 2016 .

[57]  Faisal Shafait,et al.  Search Space Reduction for Holistic Ligature Recognition in Urdu Nastalique Script , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[58]  Muhammad Imran Razzak,et al.  Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features , 2017, Neural Computing and Applications.

[59]  Shehzad Khalid,et al.  Segmentation-free optical character recognition for printed Urdu text , 2017, EURASIP J. Image Video Process..

[60]  S. A. Husain A multi-tier holistic approach for Urdu Nastaliq recognition , 2002 .

[61]  Rinku Patel,et al.  Handwritten Nastaleeq Script Recognition with BLSTM-CTC and ANFIS method , 2014 .

[62]  Srikanta Patnaik,et al.  Optical Character Recognition System for Urdu (Naskh Font) Using Pattern Matching Technique , 2009 .

[63]  Awais Adnan,et al.  Urdu ligature recognition using multi-level agglomerative hierarchical clustering , 2017, Cluster Computing.

[64]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[65]  Venu Govindaraju,et al.  Experiments on Urdu Text Recognition , 2009 .

[66]  Quara-Tul-Ain Safdar,et al.  Online Urdu Handwritten Character Recognition: Initial Half Form Single Stroke Characters , 2014, 2014 12th International Conference on Frontiers of Information Technology.

[67]  Muhammad Sher,et al.  HMM and fuzzy logic: A hybrid approach for online Urdu script-based languages' character recognition , 2010, Knowl. Based Syst..

[68]  Ali Javed,et al.  Diacritics Recognition Based Urdu Nastalique OCR System , 2014 .

[69]  Sarmad Hussain,et al.  Segmentation Based Urdu Nastalique OCR , 2013, CIARP.

[70]  Imran Siddiqi,et al.  An Ocr system for printed Nasta'liq script: A segmentation based approach , 2014, 17th IEEE International Multi Topic Conference 2014.

[71]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[72]  John McCarthy,et al.  WHAT IS ARTIFICIAL INTELLIGENCE , 1998 .

[73]  Sohail Abdul,et al.  A Finite State Model for Urdu Nastalique Optical Character Recognition , 2009 .

[74]  Awais Adnan,et al.  Urdu Nastaleeq Optical Character Recognition , 2007 .

[75]  Dzulkifli Mohamad,et al.  Off-line hand-written character recognition using integrated 1D HMMs based on feature extraction filters , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[76]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[77]  Imtiyaz Ahmed Ansari,et al.  Automatic Recognition of Offline Handwritten Urdu Digits In Unconstrained Environment Using Daubechies Wavelet Transforms , 2013 .

[78]  Sarmad Hussain,et al.  Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[79]  Tauseef Ahmad,et al.  UOCR: A ligature based approach for an Urdu OCR system , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[80]  Samee Ullah Khan,et al.  The optical character recognition of Urdu-like cursive scripts , 2014, Pattern Recognit..

[81]  Gurpreet Singh Lehal Ligature Segmentation for Urdu OCR , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[82]  Sarmad Hussain,et al.  INVESTIGATION INTO A SEGMENTATION BASED OCR FOR THE NASTALEEQ WRITING SYSTEM , 2007 .

[83]  Gurpreet Singh Lehal,et al.  Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script , 2015 .

[84]  Abdul Wahab,et al.  Optical character recognition system for Urdu , 2010, 2010 International Conference on Information and Emerging Technologies.

[85]  Junaid Tariq,et al.  Softconverter: A novel approach to construct OCR for printed Urdu isolated characters , 2010, 2010 2nd International Conference on Computer Engineering and Technology.

[86]  Faisal Shafait,et al.  A segmentation-free approach to Arabic and Urdu OCR , 2013, Electronic Imaging.

[87]  Awais Adnan,et al.  OCR For Printed Urdu Script Using Feed Forward Neural Network , 2007 .

[88]  Saeeda Naz,et al.  Arabic Script based Digit Recognition Systems , 2016 .

[89]  Marcus Liwicki,et al.  The Impact of Visual Similarities of Arabic-Like Scripts Regarding Learning in an OCR System , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[90]  S. Lloyd,et al.  Quantum algorithms for supervised and unsupervised machine learning , 2013, 1307.0411.

[91]  Tushar Patnaik,et al.  Recognition for Handwritten English Letters : A Review , 2013 .

[92]  Muhammad Imran Razzak,et al.  Zoning Features and 2DLSTM for Urdu Text-line Recognition , 2016, KES.

[93]  Saad Bin Ahmed,et al.  Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[94]  Marcus Liwicki,et al.  KPTI: Katib's Pashto Text Imagebase and Deep Learning Benchmark , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[95]  Rajendra Kumar Sharma,et al.  Review on OCR for Handwritten Indian Scripts Character Recognition , 2011 .

[96]  U. Pal,et al.  English, Devnagari and Urdu Text Identification , 2005 .

[97]  Adel M. Alimi,et al.  A New Arabic Printed Text Image Database and Evaluation Protocols , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[98]  Imran Siddiqi,et al.  Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks , 2016, Neurocomputing.

[99]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[100]  Sarmad Hussain,et al.  Segmentation Free Nastalique Urdu OCR , 2010 .