A Performance Comparison of Segmentation Techniques for the Urdu Text

Segmentation is a procedure of splitting the image contents into its subparts (i.e., line and words). For all the common language-handling applications, for example document structure extraction, content rebuilding, optical character recognition, falsifications, security, graphology, and so forth, segmentation is an indispensable and primary step. This paper presents the quantitative performance of three different text line segmentation techniques: projection method, smearing method, and edge information-based method, for the Urdu Nastaleeq type-written text. The evaluation is performed over the gathered standard data samples taken from different magazines, poetry books, and newspapers, using precision and recall metrics. In the course of evaluation, the potency and debility of algorithms are analyzed and it is spotted that smearing segmentation method checkmates the other two methods.

[1]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  Ankita Srivastava,et al.  A Novel Segmentation Technique for Urdu Type-Written Text , 2018, 2018 Recent Advances on Engineering, Technology and Computational Sciences (RAETCS).

[3]  Sohail Abdul Sattar,et al.  A Technique For The Design And Implementation Of An OCR For Printed Nastalique Text , 2009 .

[4]  Mohammad S. Khorsheed,et al.  Off-Line Arabic Character Recognition – A Review , 2002, Pattern Analysis & Applications.

[5]  Laurence Likforman-Sulem,et al.  A Hough based algorithm for extracting text lines in handwritten documents , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[6]  Hsi-Jian Lee,et al.  Recognition-based handwritten Chinese character segmentation using a probabilistic Viterbi algorithm , 1999, Pattern Recognit. Lett..

[7]  Venu Govindaraju,et al.  Historical document image enhancement using background light intensity normalization , 2004, ICPR 2004.

[8]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[9]  Waqas Anwar,et al.  Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation , 2011 .

[10]  Fatos T. Yarman-Vural,et al.  Repulsive attractive network for baseline extraction on document images , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[12]  Zhixin Shi,et al.  A natural learning algorithm based on Hough transform for text lines extraction in handwritten documents , 1999 .

[13]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Thomas M. Breuel,et al.  Performance Comparison of Six Algorithms for Page Segmentation , 2006, Document Analysis Systems.

[15]  Awais Adnan,et al.  Urdu Nastaleeq Optical Character Recognition , 2007 .