Text Line Detection for Heterogeneous Documents

Text line detection is a pre-processing step for automated document analysis such as word spotting or OCR. It is additionally used for document structure analysis or layout analysis. Considering mixed layouts, degraded documents and handwritten documents, text line detection is still challenging. We present a novel approach that targets torn documents having varying layouts and writing. The proposed method is a bottom up approach that fuses words, to globally minimize their fusing distance. In order to improve processing time and further layout analysis, text lines are represented by oriented rectangles. Even though, the method was designed for modern handwritten and printed documents, tests on medieval manuscripts give promising results. Additionally, the text line detection was evaluated on the ICDAR 2009 and ICFHR 2010 Handwriting Segmentation Contest datasets.

[1]  Shijian Lu,et al.  Binarization of historical document images using the local maximum and minimum , 2010, DAS '10.

[2]  Venu Govindaraju,et al.  2009 10th International Conference on Document Analysis and Recognition A Steerable Directional Local Profile Technique for Extraction of Handwritten Arabic Text Lines , 2022 .

[3]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[4]  Syed Saqib Bukhari,et al.  Text-Line Extraction Using a Convolution of Isotropic Gaussian Filter with a Set of Line Filters , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  Apostolos Antonacopoulos,et al.  Handwriting Segmentation Contest , 2007, ICDAR.

[6]  Robert Sablatnig,et al.  Scale Space Binarization Using Edge Information Weighted by a Foreground Estimation , 2011, 2011 International Conference on Document Analysis and Recognition.

[7]  Robert Sablatnig,et al.  Text Classification and Document Layout Analysis of Paper Fragments , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  Jihad El-Sana,et al.  Language-Independent Text Lines Extraction Using Seam Carving , 2011, 2011 International Conference on Document Analysis and Recognition.

[9]  Umapada Pal,et al.  Text line extraction in graphical documents using background and foreground information , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[10]  Robert Sablatnig,et al.  Skew Estimation of Sparsely Inscribed Document Fragments , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[11]  Vassilis Katsouros,et al.  Handwritten document image segmentation into text lines and words , 2010, Pattern Recognit..

[12]  Georgios Louloudis,et al.  ICDAR 2009 Handwriting Segmentation Contest , 2009, ICDAR.

[13]  Basilios Gatos,et al.  ICFHR 2010 Handwriting Segmentation Contest , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[14]  Alireza Alaei,et al.  A new scheme for unconstrained handwritten text-line segmentation , 2011, Pattern Recognit..

[15]  E. Kuh,et al.  Linear Regression Diagnostics , 1977 .

[16]  Changsong Liu,et al.  A Multi-scale Text Line Segmentation Method in Freestyle Handwritten Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[17]  Aurélie Lemaitre,et al.  A perceptive method for handwritten text segmentation , 2011, Electronic Imaging.

[18]  Angelika Garz,et al.  Binarization-Free Text Line Segmentation for Historical Documents Based on Interest Point Clustering , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.