A Comparative analysis for identification and classification of text segmentation challenges in Takri Script

Takri is an Indian regional class of scripts, used in hilly areas of north-west India which include Jammu and Kashmir (J & K), Himachal Pradesh (H.P.), Punjab and Uttarakhand. This script has immense variations; almost 13 identified in the whole region of North-west India. It has been observed that no work for text identification and recognition of Takri script has been done so far. Therefore, our work focuses on identifying and classifying the various challenges in the script based on comparative analysis of existing text segmentation approaches, as correct segmentation of text leads to more accurate machine recognition. As there were no metal fonts available for the script, it is required to collect the machine-printed form of data for solving the text identification problem in Takri script. The paper surveys for different text segmentation approaches and based on the structural properties of the script, shows an implementation of these on Takri data in three steps- Gurmukhi segmentation technique, Connected Component segmentation approach, and Gurmukhi touching characters segmentation approach. Results are analyzed for Segmentation Accuracy and Challenges are identified along with their statistical analysis. Further, the challenges identified as half- forms, numerous types of touching characters, overlapping bounding boxes, are classified. The effectiveness of these challenges was evaluated using Naïve-Bayesian machine learning algorithm. The results showed 80% accuracy in text identification and classification of Takri script.

[1]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Prasenjit Dey,et al.  HMM-based Indic handwritten word recognition using zone segmentation , 2016, Pattern Recognit..

[3]  Nibaran Das,et al.  Script Identification from Printed Indian Document Images and Performance Evaluation Using Different Classifiers , 2014, Appl. Comput. Intell. Soft Comput..

[4]  C. V. Jawahar,et al.  Learning Segmentation of Documents with Complex Scripts , 2006, ICVGIP.

[5]  R. D. Sudhaker Samuel,et al.  A simple and efficient optical character recognition system for basic symbols in printed Kannada text , 2007 .

[6]  Chellapilla Patvardhan,et al.  An optical character recognition system for printed Telugu text , 2004, Pattern Analysis and Applications.

[7]  Munish Kumar,et al.  Segmentation of Isolated and Touching Characters in Offline Handwritten Gurmukhi Script Recognition , 2014 .

[8]  Amardeep Singh,et al.  Detection and segmentation of Handwritten Text in Gurmukhi Script using Flexible Windowing , 2010 .

[9]  Haruo Asada,et al.  Resolving Ambiguity in Segmenting Touching Characters , 1992 .

[10]  Umapada Pal,et al.  Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques , 2012, TALIP.

[11]  V. K. Govindan,et al.  Character recognition - A review , 1990, Pattern Recognit..

[12]  Bidyut Baran Chaudhuri,et al.  Indian script character recognition: a survey , 2004, Pattern Recognit..

[13]  Hardeep Singh,et al.  A hybrid approach to character segmentation of Gurmukhi script characters , 2003, 32nd Applied Imagery Pattern Recognition Workshop, 2003. Proceedings..

[14]  Rajendra Kumar Sharma,et al.  A Study of Different Kinds of Degradation in Printed Gurmukhi Script , 2007, 2007 International Conference on Computing: Theory and Applications (ICCTA'07).

[15]  Anshuman Pandey Proposal to encode the Dogra script in Unicode , 2015 .

[16]  Rajendra Kumar Sharma,et al.  Segmentation Problems and Solutions in Printed Degraded Gurmukhi Script , 2006 .

[17]  Yasuaki Nakano,et al.  Segmentation methods for character recognition: from segmentation to document structure analysis , 1992, Proc. IEEE.

[18]  Sandeep Kaur,et al.  Gurmukhi Printed Character Recognition using Hierarchical Centroid Method and SVM , 2016 .

[19]  J. Mantas,et al.  An overview of character recognition methodologies , 1986, Pattern Recognit..

[20]  Majid Ahmadi,et al.  Segmentation of touching characters in printed document recognition , 1994, Pattern Recognit..

[21]  Gurpreet Singh Lehal A Complete Machine-Printed Gurmukhi OCR System , 2009 .

[22]  Theodosios Pavlidis,et al.  On the Recognition of Printed Characters of Any Font and Size , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Rajendra Kumar Sharma,et al.  On Segmentation of Touching Characters and Overlapping Lines in Degraded Printed Gurmukhi Script , 2009, Int. J. Image Graph..

[24]  Anshuman Pandey Proposal to Encode the Takri Script in ISO/IEC 10646 , 2009 .

[25]  Rajendra Kumar Sharma,et al.  Segmentation of touching characters in upper zone in printed Gurmukhi script , 2009, COMPUTE '09.

[26]  Chinmoy B. Bose,et al.  Connected and degraded text recognition using hidden Markov model , 1994, Pattern Recognit..

[27]  Chandan Singh,et al.  A Technique for Segmentation of Gurmukhi Text , 2001, CAIP.

[28]  Samee Ullah Khan,et al.  The optical character recognition of Urdu-like cursive scripts , 2014, Pattern Recognit..

[29]  Amardeep Singh,et al.  Character Segmentation in Gurumukhi Handwritten Textusing Hybrid Approach , 2011 .

[30]  Debashis Ghosh,et al.  Script Recognition—A Review , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.