Segmentation of Handwritten Document Images into Text Lines

There are many governmental, cultural, commercial and educational organizations that manage large number of manuscript textual information. Since the management of information recorded on paper or scanned documents is a hard and time-consuming task, Document Image Analysis (DIA) aims to extract the intended information as a human would (Nagy, 2000). The main subtasks of DIA (Mao et al. 2003) are: i) the document layout analysis, which aims to locate the “physical” components of the document such as columns, paragraphs, text lines, words, tables and figures, ii) the document content analysis, for understanding/labelling these components as titles, legends, footnotes, etc. iii) the optical character recognition (OCR) and iv) the reconstruction of the corresponding electronic document. The proposed algorithms that address the above-mentioned processing stages come mainly from the fields of image processing, computer vision, machine learning and pattern recognition. Actually, some of these algorithms are very effective in processing machineprinted document images and therefore they have been incorporated in the workflows of well-known OCR systems. On the contrary, no such efficient systems have been developed for handling handwritten documents. The main reason is that the format of a handwritten manuscript and the writing style depend solely on the author's choices. For example, one could consider that text lines in a machine-printed document are of the same skew, while handwritten text lines may be curvilinear. Text line segmentation is a critical stage in layout analysis, upon which further tasks such as word segmentation, grouping of text lines into paragraphs, characterization of text lines as titles, headings, footnotes, etc. may be developed. For instance, a task for text-line segmentation is involved in the pipeline of the Handwritten Address Interpretation System (HWAIS), which takes a postal address image and determines a unique delivery point (Cohen et al., 1994). Another application, in which text line extraction is considered as a preprocessing step, is the indexing of George Washington papers at the Library of Congress as detailed by Manmatha & Rothfeder, 2005. A similar document analysis project, called the Bovary Project, includes a text-line segmentation stage towards the transcription of the manuscripts of Gustave Flaubert (Nicolas et al., 2004a). In addition, many recent projects, which focus on digitisation of archives, include activities for document image understanding in terms of automatic or semi-automatic extraction and indexing of metadata such as titles, subtitles, keywords, etc. (Antonacopoulos & Karatzas, 2004, Tomai et al., 2002). Obviously, these activities include text-line extraction.

[1]  Pierre Soille,et al.  Morphological Image Analysis: Principles and Applications , 2003 .

[2]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[5]  Syed Saqib Bukhari,et al.  Script-Independent Handwritten Textlines Segmentation Using Active Contours , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[6]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[7]  Sargur N. Srihari,et al.  Robust line segmentation for handwritten documents , 2008, Electronic Imaging.

[8]  Wayne Niblack,et al.  An introduction to digital image processing , 1986 .

[9]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[10]  Linda G. Shapiro,et al.  Computer and Robot Vision , 1991 .

[11]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[12]  Thierry Paquet,et al.  Text line segmentation in handwritten document using a production system , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[13]  Venu Govindaraju,et al.  Line separation for complex document images using fuzzy runlength , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[14]  Carlos Guedes,et al.  A connected path approach for staff detection on a music score , 2008, 2008 15th IEEE International Conference on Image Processing.

[15]  Dan S. Bloomberg Textured reductions for document image analysis , 1996, Electronic Imaging.

[16]  Fei Yin,et al.  Handwritten Chinese text line segmentation by clustering with distance metric learning , 2009, Pattern Recognit..

[17]  Subhadip Basu,et al.  Text Line Segmentation for Unconstrained Handwritten Document Images Using Neighborhood Connected Component Analysis , 2009, PReMI.

[18]  R. Manmatha,et al.  A scale space approach for automatically segmenting words from historical handwritten documents , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[20]  C. Weliwitage,et al.  Handwritten Document Offline Text Line Segmentation , 2005, Digital Image Computing: Techniques and Applications (DICTA'05).

[21]  Ching Y. Suen,et al.  Thinning Methodologies - A Comprehensive Survey , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Vassilis Katsouros,et al.  Handwritten document image segmentation into text lines and words , 2010, Pattern Recognit..

[23]  Georgios Louloudis,et al.  ICDAR 2009 Handwriting Segmentation Contest , 2009, ICDAR.

[24]  Jonathan J. Hull Document Image skew Detection: Survey and Annotated Bibliography , 1996, DAS.

[25]  Berrin A. Yanikoglu,et al.  Segmentation of off-line cursive handwriting using linear programming , 1998, Pattern Recognit..

[26]  George D. C. Cavalcanti,et al.  Text Line Segmentation Based on Morphology and Histogram Projection , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[27]  Tien D. Bui,et al.  Text line segmentation in handwritten documents using Mumford-Shah model , 2009, Pattern Recognit..

[28]  Apostolos Antonacopoulos,et al.  Document image analysis for World War II personal records , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[29]  Yi Li,et al.  Script-Independent Text Line Segmentation in Freestyle Handwritten Documents , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Ioannis Pratikakis,et al.  Text line detection in handwritten documents , 2008, Pattern Recognit..

[32]  Richard O. Duda,et al.  Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[33]  Nikos Fakotakis,et al.  An Integrated System for Handwritten Document Image Processing , 2003, Int. J. Pattern Recognit. Artif. Intell..

[34]  Klaus D. Tönnies,et al.  Line detection and segmentation in historical church registers , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[35]  Jan-Olof Eklundh,et al.  Scale-space primal sketch: construction and experiments , 1992, Image Vis. Comput..

[36]  Basilios Gatos,et al.  Handwriting Segmentation Contest , 2007, ICDAR.

[37]  Luc Vincent,et al.  Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Ronald Fedkiw,et al.  Level set methods and dynamic implicit surfaces , 2002, Applied mathematical sciences.

[39]  Bin Zhang,et al.  Transcript mapping for historic handwritten document images , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[40]  Sargur N. Srihari,et al.  A statistical approach to line segmentation in handwritten documents , 2007, Electronic Imaging.

[41]  Basilios Gatos,et al.  Handwritten Text Line Segmentation by Shredding Text into its Lines , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[42]  Sargur N. Srihari,et al.  Control Structure for Interpreting Handwritten Addresses , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Thierry Paquet,et al.  Enriching Historical Manuscripts: The Bovary Project , 2004, Document Analysis Systems.

[44]  Fei Yin,et al.  2009 10th International Conference on Document Analysis and Recognition A Variational Bayes Method for Handwritten Text Line Segmentation , 2022 .

[45]  A. Peter Johnson,et al.  A Fast Algorithm for Bottom-Up Document Layout Analysis , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Noorzaily Mohamed Noor Off-line Handwriting Text Line Segmentation : A Review , 2008 .

[47]  Venu Govindaraju,et al.  2009 10th International Conference on Document Analysis and Recognition A Steerable Directional Local Profile Technique for Extraction of Handwritten Arabic Text Lines , 2022 .