Binarization-Free Text Line Segmentation for Historical Documents Based on Interest Point Clustering

Segmenting page images into text lines is a crucial pre-processing step for automated reading of historical documents. Challenging issues in this open research field are given \eg by paper or parchment background noise, ink bleed-through, artifacts due to aging, stains, and touching text lines. In this paper, we present a novel binarization-free line segmentation method that is robust to noise and copes with overlapping and touching text lines. First, interest points representing parts of characters are extracted from gray-scale images. Next, word clusters are identified in high-density regions and touching components such as ascenders and descenders are separated using seam carving. Finally, text lines are generated by concatenating neighboring word clusters, where neighborhood is defined by the prevailing orientation of the words in the document. An experimental evaluation on the Latin manuscript images of the Saint Gall database shows promising results for real-world applications in terms of both accuracy and efficiency.

[1]  Nikos A. Nikolaou,et al.  Segmentation of historical machine-printed documents using Adaptive Run Length Smoothing and skeleton segmentation paths , 2010, Image Vis. Comput..

[2]  Its'hak Dinstein,et al.  2009 10th International Conference on Document Analysis and Recognition Line segmentation for degraded handwritten historical documents , 2022 .

[3]  Jihad El-Sana,et al.  Text line segmentation for gray scale historical document images , 2011, HIP '11.

[4]  Abderrazak Zahour,et al.  Arabic hand-written text-line extraction , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[5]  Venu Govindaraju,et al.  Line separation for complex document images using fuzzy runlength , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[6]  Robert M. Haralick,et al.  Optimal matching problem in detection and recognition performance evaluation , 2002, Pattern Recognit..

[7]  Lambert Schomaker,et al.  Layout Analysis of Handwritten Historical Documents for Searching the Archive of the Cabinet of the Dutch Queen , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[8]  Marcus Liwicki,et al.  Combining Alignment Results for Historical Handwritten Document Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[9]  Angelika Garz,et al.  Layout Analysis for Historical Manuscripts Using Sift Features , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  Alicia Fornés,et al.  Transcription alignment of Latin manuscripts using hidden Markov models , 2011, HIP '11.

[11]  Yi Li,et al.  Script-Independent Text Line Segmentation in Freestyle Handwritten Documents , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  S. Avidan,et al.  Seam carving for content-aware image resizing , 2007, SIGGRAPH 2007.

[13]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14]  Apostolos Antonacopoulos,et al.  Document image analysis for World War II personal records , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[15]  G. Louloudisa,et al.  Text line detection in handwritten documents , 2008 .

[16]  Basilios Gatos,et al.  Handwritten Text Line Segmentation by Shredding Text into its Lines , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[17]  Jean-Luc Bloechle,et al.  Semi-automatic Annotation Tool for Medieval Manuscripts , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[18]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[19]  Marcus Liwicki,et al.  On-Line Handwritten Text Line Detection Using Dynamic Programming , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[20]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[21]  Laurence Likforman-Sulem,et al.  A Hough based algorithm for extracting text lines in handwritten documents , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[22]  Georgi Gluhchev,et al.  Handwritten document image segmentation and analysis , 1993, Pattern Recognit. Lett..