Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey

Arabic script based text recognition system has been a popular field of research for many years that can be used in the learning and teaching process to the students and educators how to read and understand educational contents of Arabic script. The challenging nature of Arabic script recognition has attracted the attention of researchers from both industry and academic circles but these efforts have not achieved good results until now. Segmentation of Urdu script when written in Nasta’liq writing style is very difficult task due to the complexity of writing style as compare to Naskh writing style. Good segmentation is one of the reasons for high accuracy. Character segmentation has been a critical phase of the OCR process. The higher recognition rates for isolated characters as compare to results of words or connected character well illustrate the importance of segmentation. Current study investigates the recent work for character segmentation and challenges for segmentation for Arabic script based languages.

[1]  Sarmad Hussain,et al.  Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[2]  Sohail Abdul,et al.  A Finite State Model for Urdu Nastalique Optical Character Recognition , 2009 .

[3]  Sarmad Hussain,et al.  Improving Nastalique specific pre-recognition process for Urdu OCR , 2009, 2009 IEEE 13th International Multitopic Conference.

[4]  Bidyut Baran Chaudhuri,et al.  Indian script character recognition: a survey , 2004, Pattern Recognit..

[5]  Muhammad Imran Razzak,et al.  Arabic script based language character recognition: Nasta'liq vs Naskh analysis , 2013, 2013 World Congress on Computer and Information Technology (WCCIT).

[6]  Tarek M. Sobh,et al.  Innovations and Advanced Techniques in Computer and Information Sciences and Engineering , 2007 .

[7]  Gurpreet Singh Lehal Choice of recognizable units for URDU OCR , 2012, DAR '12.

[8]  Saad Bin Ahmed,et al.  Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[9]  R. J. Ramteke,et al.  Noise Reduction in Urdu Document Image–Spatial and Frequency Domain Approaches , 2013 .

[10]  Sohail Abdul Sattar,et al.  A Technique For The Design And Implementation Of An OCR For Printed Nastalique Text , 2009 .

[11]  Dalila Megherbi,et al.  Fuzzy-logic-model-based technique with application to Urdu character recognition , 2000, Electronic Imaging.

[12]  Imran Siddiqi,et al.  Towards Searchable Digital Urdu Libraries - A Word Spotting Based Retrieval Approach , 2011, 2011 International Conference on Document Analysis and Recognition.

[13]  Quintin Gee,et al.  Implementation Challenges for Nastaliq Character Recognition , 2008, IMTIC.

[14]  Samee Ullah Khan,et al.  The optical character recognition of Urdu-like cursive scripts , 2014, Pattern Recognit..

[15]  Gurpreet Singh Lehal Ligature Segmentation for Urdu OCR , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[16]  Sarmad Hussain,et al.  INVESTIGATION INTO A SEGMENTATION BASED OCR FOR THE NASTALEEQ WRITING SYSTEM , 2007 .

[17]  Abdul Wahab,et al.  Optical character recognition system for Urdu , 2010, 2010 International Conference on Information and Emerging Technologies.

[18]  Farooq Ahmed,et al.  Shape analysis of Pashto script and creation of image database for OCR , 2009, 2009 International Conference on Emerging Technologies.

[19]  Z. A. Shah,et al.  Ligature based optical character recognition of Urdu- Nastaleeq font , 2002 .

[20]  Muhammad Abuzar Fahiem,et al.  Segmentation of Printed Urdu Scripts Using Structural Features , 2009, 2009 Second International Conference in Visualisation.

[21]  Venu Govindaraju,et al.  Offline Arabic handwriting recognition: a survey , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Shehzad Khalid,et al.  Recognition of Urdu ligatures - a holistic approach , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[23]  Zubair A. Shaikh,et al.  Character Segmentation of Sindhi, an Arabic Style Scripting Language, using Height Profile Vector , 2009 .

[24]  Volker Märgner,et al.  Databases and Competitions: Strategies to Improve Arabic Recognition Systems , 2006, SACH.

[25]  Awais Adnan,et al.  Urdu Nastaleeq Optical Character Recognition , 2007 .

[26]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[27]  Morteza Zahedi,et al.  Farsi/Arabic optical font recognition using SIFT features , 2011, WCIT.

[28]  Akram M. Zeki,et al.  The Segmentation Problem in Arabic Character Recognition The State Of The Art , 2005 .

[29]  Muhammad Imran Razzak,et al.  Challenges in baseline detection of cursive script languages , 2013, 2013 Science and Information Conference.

[30]  Sarmad Hussain,et al.  Font Size Independent OCR for Noori Nastaleeq , 2009 .

[31]  Sarmad Hussain,et al.  Context Sensitive Shape-Substitution in Nastaliq Writing System: Analysis and Formulation , 2007 .

[32]  Sarmad Hussain,et al.  Corpus Based Urdu Lexicon Development , 2007 .

[33]  Sarmad Hussain,et al.  Word Segmentation for Urdu OCR System , 2010 .

[34]  Yasser M. Alginahi,et al.  A survey on Arabic character segmentation , 2012, International Journal on Document Analysis and Recognition (IJDAR).

[35]  Sarmad Hussain,et al.  Segmentation Free Nastalique Urdu OCR , 2010 .

[36]  Neil W. Bergmann,et al.  An Arabic optical character recognition system using recognition-based segmentation , 2001, Pattern Recognit..

[37]  Inam Shamsher,et al.  Urdu compound Character Recognition using feed forward neural networks , 2009, 2009 2nd IEEE International Conference on Computer Science and Information Technology.

[38]  S. A. Husain A multi-tier holistic approach for Urdu Nastaliq recognition , 2002 .

[39]  Faisal Shafait,et al.  Search Space Reduction for Holistic Ligature Recognition in Urdu Nastalique Script , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[40]  Riaz Ahmad,et al.  Scale and rotation invariant recognition of cursive Pashto script using SIFT features , 2010, 2010 6th International Conference on Emerging Technologies (ICET).

[41]  A. Ali,et al.  Language independent optical character recognition for hand written text , 2004, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[42]  Imran Siddiqi,et al.  An Unconstrained Benchmark Urdu Handwritten Sentence Database with Automatic Line Segmentation , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[43]  Khalil Khan,et al.  Urdu Character Recognition using Principal Component Analysis , 2012 .

[44]  Safdar Zaman,et al.  A Self Organizing Map based Urdu Nasakh character recognition , 2009, 2009 International Conference on Emerging Technologies.

[45]  R. J. Ramteke,et al.  Skew Angle Estimation of Urdu Document Images: A Moments Based Approach , 2011 .

[46]  Ching Y. Suen,et al.  A New Large Urdu Database for Off-Line Handwriting Recognition , 2009, ICIAP.

[47]  Sabri A. Mahmoud,et al.  Arabic handwriting recognition using structural and syntactic pattern attributes , 2013, Pattern Recognit..

[48]  Raymond Smith,et al.  Adapting the Tesseract open source OCR engine for multilingual OCR , 2009, MOCR '09.

[49]  S. Hussain,et al.  Rule-based expert system for Urdu Nastaleeq justification , 2004, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[50]  Faisal Shafait,et al.  A segmentation-free approach to Arabic and Urdu OCR , 2013, Electronic Imaging.

[51]  Awais Adnan,et al.  OCR For Printed Urdu Script Using Feed Forward Neural Network , 2007 .

[52]  Syed Saqib Bukhari,et al.  Layout Analysis of Arabic Script Documents , 2012 .

[53]  U. Pal,et al.  Recognition of printed Urdu script , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[54]  Muhammad Atique Ur Rehman,et al.  A New Scale Invariant Optimized Chain Code for Nastaliq Character Representation , 2010, 2010 Second International Conference on Computer Modeling and Simulation.

[55]  Sarmad Hussain,et al.  Binarization and its evaluation for Urdu Nastalique document images , 2013, INMIC.

[56]  Sabri A. Mahmoud,et al.  Survey and bibliography of Arabic optical text recognition , 1995, Signal Process..