Text line and word segmentation of handwritten documents

In this paper, we present a segmentation methodology of handwritten documents in their distinct entities, namely, text lines and words. Text line segmentation is achieved by applying Hough transform on a subset of the document image connected components. A post-processing step includes the correction of possible false alarms, the detection of text lines that Hough transform failed to create and finally the efficient separation of vertically connected characters using a novel method based on skeletonization. Word segmentation is addressed as a two class problem. The distances between adjacent overlapped components in a text line are calculated using the combination of two distance metrics and each of them is categorized either as an inter- or an intra-word distance in a Gaussian mixture modeling framework. The performance of the proposed methodology is based on a consistent and concrete evaluation methodology that uses suitable performance measures in order to compare the text line segmentation and word segmentation results against the corresponding ground truth annotation. The efficiency of the proposed methodology is demonstrated by experimentation conducted on two different datasets: (a) on the test set of the ICDAR2007 handwriting segmentation competition and (b) on a set of historical handwritten documents.

[1]  Jiang Yong,et al.  Text line extraction from multi-skewed handwritten documents , 2008, 2008 27th Chinese Control Conference.

[2]  Apostolos Antonacopoulos,et al.  ICDAR 2009 Page Segmentation Competition , 2003, 2009 10th International Conference on Document Analysis and Recognition.

[3]  Horst Bunke,et al.  Using Hidden Markov Models as a Tool for Handwritten Text Line Segmentation , 2007 .

[4]  Thierry Paquet,et al.  Text line segmentation in handwritten document using a production system , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[5]  William A. Barrett,et al.  Separating lines of text in free-form handwritten historical documents , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[6]  Laurence Likforman-Sulem,et al.  A Hough based algorithm for extracting text lines in handwritten documents , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[7]  C. Halatsis,et al.  Line And Word Segmentation of Handwritten Documents , 2008 .

[8]  Venu Govindaraju,et al.  Text extraction from gray scale historical document images using adaptive local connectivity map , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9]  Yi Li,et al.  Detecting Text Lines in Handwritten Documents , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[10]  Elisabetta Bruzzone,et al.  An algorithm for extracting cursive text lines , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[11]  Venu Govindaraju,et al.  Line separation for complex document images using fuzzy runlength , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[12]  Gyeonghwan Kim,et al.  An architecture for handwritten text recognition systems , 1999, International Journal on Document Analysis and Recognition.

[13]  Basilios Gatos,et al.  ICDAR2005 page segmentation competition , 2007, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[14]  Stavros J. Perantonis,et al.  A method for combining complementary techniques for document image segmentation , 2009, Pattern Recognit..

[15]  Ihsin T. Phillips,et al.  Empirical Performance Evaluation of Graphics Recognition Systems , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[17]  Gyeonghwan Kim,et al.  Handwritten phrase recognition as applied to street name images , 1998, Pattern Recognit..

[18]  Giovanni Seni,et al.  External word segmentation of off-line handwritten text lines , 1994, Pattern Recognit..

[19]  Aurélie Lemaitre,et al.  Text line extraction in handwritten document with Kalman filter applied on low resolution image , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[20]  Basilios Gatos,et al.  Handwriting Segmentation Contest , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[21]  R. Manmatha,et al.  A scale space approach for automatically segmenting words from historical handwritten documents , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[23]  C. Weliwitage,et al.  Handwritten Document Offline Text Line Segmentation , 2005, Digital Image Computing: Techniques and Applications (DICTA'05).

[24]  Ching Y. Suen,et al.  Word segmentation in handwritten Korean text lines based on gap clustering techniques , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[25]  Tien D. Bui,et al.  Text line segmentation in handwritten documents using Mumford-Shah model , 2009, Pattern Recognit..

[26]  Horst Bunke,et al.  Text line segmentation and word recognition in a system for general writer independent handwriting recognition , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[27]  Umapada Pal,et al.  Morphology Based Handwritten Line Segmentation Using Foreground and Background Information , 2008 .

[28]  Vassilis Katsouros,et al.  Robust text-line and word segmentation for handwritten documents images , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Georgios Louloudis,et al.  ICDAR 2009 Handwriting Segmentation Contest , 2009, ICDAR.

[30]  Uma Mahadevan,et al.  Gap metrics for word separation in handwritten lines , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[31]  Chun-Jen Chen,et al.  A linear-time component-labeling algorithm using contour tracing technique , 2004, Comput. Vis. Image Underst..

[32]  Horst Bunke,et al.  Tree structure for word extraction from handwritten text lines , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[33]  Zhixin Shi,et al.  A natural learning algorithm based on Hough transform for text lines extraction in handwritten documents , 1999 .

[34]  Jean-Michel Marin,et al.  Bayesian Modelling and Inference on Mixtures of Distributions , 2005 .

[35]  Fei Yin,et al.  Handwritten text line extraction based on minimum spanning tree clustering , 2007, 2007 International Conference on Wavelet Analysis and Pattern Recognition.

[36]  Pinar Duygulu Sahin,et al.  Retrieval of Ottoman documents , 2006, MIR '06.

[37]  Apostolos Antonacopoulos,et al.  Handwriting Segmentation Contest , 2007, ICDAR.

[38]  Sargur N. Srihari,et al.  Word segmentation of off-line handwritten documents , 2008, Electronic Imaging.

[39]  Ioannis Pratikakis,et al.  A segmentation-free approach for keyword search in historical typewritten documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[40]  Ioannis Pratikakis,et al.  A Block-Based Hough Transform Mapping for Text Line Detection in Handwritten Documents , 2006 .

[41]  Juergen Luettin,et al.  A new normalization technique for cursive handwritten words , 2001, Pattern Recognit. Lett..

[42]  Sargur N. Srihari,et al.  A statistical approach to line segmentation in handwritten documents , 2007, Electronic Imaging.