Techniques for Text, Line and Word Segmentation

Handwritten text recognition is one of the most challenging tasks since decades. Text recognition plays an important role in document image processing. The text line segmentation is the critical task. The line segmentation plays an important role. The performance of the Optical Character Recognition depends on the segmentation input. There are many methods existing as per the survey for line segmentation, word segmentation and character segmentation. This paper provides an extensive methods existing for the process which involves line extraction, word segmentation and character segmentation. Some of the methods have provided very good accuracy. Keywords—OCR, Segmenation, Handwritten text INTRODUCTION Handwritten text recognition has been one of the most challenging tasks since decades. An extensive study has been going on to get good accuracy. Handwritten text recognition can be either offline or online. Offline consist of scanning the handwritten form or document. This is performed by extracting characters from the scanned document image. Online involves automatic conversion of text that is written on special digitizer or PDA. One of the most important steps in offline text recognition is text segmentation. Text segmentation has the following three steps: Line extraction: Text line extraction is the first step in any text segmentation process. This involves extracting the text lines from the scanned image. This is one of the most challenging steps in text recognition process since handwritten text document has multi-orientations; overlapping of characters, skew etc. Word segmentation: After the lines are extracted the second step is to segment the individual words from the extracted text lines. These extracted words are then used further for text recognition. Word segmentation can be done in two ways: 1. Analytical approach: In this process the word are identified by first identifying the characters that makes up the word. 2. Holistic approach: This approach treats the word as a single entity and recognizes it based on its features. Character segmentation: From the extracted words the individual characters are segmented in this step. This step also present challenging since different writes have different writing styles. 1. LINE SEGMENATION METHODS  Line and word segmentation of handwritten documents by G.Louloudis[1]. It proposes method for both text and word extraction. Text line extraction is performed by using Hough transform. A post processing step is performed to segment the lines the Hough transform fails. Word segmentation is performed by diving the words as inter-word or intraword depending on comparision of the distances with a threshold.It achieves detection rate of 90.4% and a recognition accuracy of 90.6%.  Text line segmentation of handwritten document using constraint seam carving by Xi Zhang[2] . It proposes a constraint seam carving that works well for mutiskewed lines. This method extracts text lines by constraining the energy that is passed along the connected component of the same text lines. It achieves an accuracy of 98.4%.It is tested on the Greek, English and Indian document image.  Handwritten text line extraction based on Minimum Spanning tree by Fei Yin[3]. It proposes a method based on the construction of minimum spanning tree. In this technique first the minimum spanning tree is constructed by clustering. From the tree edges text lines are extracted. It achieves an accuracy of 88.4% on Chinese document  Text line segmentation in handwritten document using Mumford-Shah Model by Xiaojun Du[4]. It proposes a segmentation algorithm know as Mumford-Shah model. It is script independent and it achieves segmentation by minimizing the MS energy function. Morphing is also used to remove overlaps between neighboring text lines. It also connects broken lines. The result does not depend on the no. of evolution steps involved.  Language independent text line extraction using seam carving by Raid Saabni [5]. It proposes an algorithm that is based on seam carving approach that is used for content image resizing. The experimental results on Arabic, Chinese and English historical documents shows that this approach manages to separate multiInternational Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Published by, www.ijert.org ICESMART-2015 Conference Proceedings Volume 3, Issue 19 Special Issue 2015

[1]  C. Halatsis,et al.  Line And Word Segmentation of Handwritten Documents , 2008 .

[2]  Sargur N. Srihari,et al.  Word image retrieval using binary features , 2003, IS&T/SPIE Electronic Imaging.

[3]  F. Perronnin,et al.  Local gradient histogram features for word spotting in unconstrained handwritten documents , 2008 .

[4]  David D. Palmer A trainable rule-based algorithm for word segmentation , 1997 .

[5]  Sayantan Sarkar Word Spotting in Cursive Handwritten Documents Using Modified Character Shape Codes , 2012, ACITY.

[6]  Jihad El-Sana,et al.  Language-Independent Text Lines Extraction Using Seam Carving , 2011, 2011 International Conference on Document Analysis and Recognition.

[7]  David D. Palmer,et al.  A Trainable Rule-Based Algorithm for Word Segmentation , 1997, ACL.

[8]  Syed Saqib Bukhari,et al.  Script-Independent Handwritten Textlines Segmentation Using Active Contours , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[9]  Fei Yin,et al.  Handwritten text line extraction based on minimum spanning tree clustering , 2007, 2007 International Conference on Wavelet Analysis and Pattern Recognition.

[10]  M. Shruthi,et al.  A Robust Invariant Approach for Word Segmentation of Document Images , 2014 .

[11]  R. Sarkar,et al.  Handwritten Word Recognition Using MLP based Classifier: A Holistic Approach , 2013 .

[12]  Pietro Perona,et al.  Using hierarchical shape models to spot keywords in cursive handwriting data , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[13]  Ghazali Sulong,et al.  Cursive script segmentation with neural confidence , 2011 .

[14]  Munish Kumar,et al.  Segmentation of Isolated and Touching Characters in Offline Handwritten Gurmukhi Script Recognition , 2014 .

[15]  Basilios Gatos,et al.  Handwritten Text Line Segmentation by Shredding Text into its Lines , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[16]  Michael Blumenstein,et al.  The neural-based segmentation of cursive words using enhanced heuristics , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[17]  Laurence Likforman-Sulem,et al.  A Hough based algorithm for extracting text lines in handwritten documents , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[18]  Berrin A. Yanikoglu,et al.  Segmentation of off-line cursive handwriting using linear programming , 1998, Pattern Recognit..

[20]  Jun Zhou,et al.  Handwritten text segmentation using average longest path algorithm , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).