Overlapping and multi-touching text-line segmentation by Block Covering analysis

This paper presents a new approach for text-line segmentation based on Block Covering which solves the problem of overlapping and multi-touching components. Block Covering is the core of a system which processes a set of ancient Arabic documents from historical archives. The system is designed for separating text-lines even if they are overlapping and multi-touching. We exploit the Block Covering technique in three steps: a new fractal analysis (Block Counting) for document classification, a statistical analysis of block heights for block classification and a neighboring analysis for building text-lines. The Block Counting fractal analysis, associated with a fuzzy C-means scheme, is performed on document images in order to classify them according to their complexity: tightly (closely) spaced documents (TSD) or widely spaced documents (WSD). An optimal Block Covering is applied on TSD documents which include overlapping and multi-touching lines. The large blocks generated by the covering are then segmented by relying on the statistical analysis of block heights. The final labeling into text-lines is based on a block neighboring analysis. Experimental results provided on images of the Tunisian Historical Archives reveal the feasibility of the Block Covering technique for segmenting ancient Arabic documents.

[1]  Joshua Alspector,et al.  A Line-Oriented Approach to Word Spotting in Handwritten Documents , 2000, Pattern Analysis & Applications.

[2]  Haikal El Abed,et al.  A Concept for the Separation of Foreground/ Background in Arabic Historical Manuscripts using Hybrid Methods , 2006, VAST.

[3]  Adel M. Alimi,et al.  Can fractal dimension be used in font classification , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[4]  F. Hausdorff Dimension und äußeres Maß , 1918 .

[5]  Sargur N. Srihari,et al.  A statistical approach to line segmentation in handwritten documents , 2007, Electronic Imaging.

[6]  Adnan Amin,et al.  A Document Skew Detection Method Using the Hough Transform , 2000, Pattern Analysis & Applications.

[7]  Fatos T. Yarman-Vural,et al.  Repulsive attractive network for baseline extraction on document images , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Nicole Vincent,et al.  HOW TO USE FRACTAL DIMENSIONS TO QUALIFY WRITINGS AND WRITERS , 2000 .

[9]  Abderrazak Zahour,et al.  Arabic hand-written text-line extraction , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[10]  K. Falconer Techniques in fractal geometry , 1997 .

[11]  Murray J. J. Holt,et al.  Line extraction and stroke ordering of text pages , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[12]  Venu Govindaraju,et al.  Offline Arabic handwriting recognition: a survey , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Mohammad S. Khorsheed,et al.  Off-Line Arabic Character Recognition – A Review , 2002, Pattern Analysis & Applications.

[14]  Adel M. Alimi,et al.  A methodology for the separation of foreground/background in Arabic historical manuscripts using hybrid methods , 2007, SAC '07.

[15]  Yi Li,et al.  Detecting Text Lines in Handwritten Documents , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[16]  Simon Kristensen,et al.  Hausdorff Dimension and Diophantine Approximation , 2003, math/0305399.

[17]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[18]  Chellapilla Patvardhan,et al.  An optical character recognition system for printed Telugu text , 2004, Pattern Analysis and Applications.

[19]  Tommy W. S. Chow,et al.  Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density , 2004, Pattern Recognit..