论文信息 - Overlapping and multi-touching text-line segmentation by Block Covering analysis

Overlapping and multi-touching text-line segmentation by Block Covering analysis

This paper presents a new approach for text-line segmentation based on Block Covering which solves the problem of overlapping and multi-touching components. Block Covering is the core of a system which processes a set of ancient Arabic documents from historical archives. The system is designed for separating text-lines even if they are overlapping and multi-touching. We exploit the Block Covering technique in three steps: a new fractal analysis (Block Counting) for document classification, a statistical analysis of block heights for block classification and a neighboring analysis for building text-lines. The Block Counting fractal analysis, associated with a fuzzy C-means scheme, is performed on document images in order to classify them according to their complexity: tightly (closely) spaced documents (TSD) or widely spaced documents (WSD). An optimal Block Covering is applied on TSD documents which include overlapping and multi-touching lines. The large blocks generated by the covering are then segmented by relying on the statistical analysis of block heights. The final labeling into text-lines is based on a block neighboring analysis. Experimental results provided on images of the Tunisian Historical Archives reveal the feasibility of the Block Covering technique for segmenting ancient Arabic documents.

Laurence Likforman-Sulem | Abderrazak Zahour | Bruno Taconet | Wafa Boussellaa

[1] Joshua Alspector,et al. A Line-Oriented Approach to Word Spotting in Handwritten Documents , 2000, Pattern Analysis & Applications.

[2] Haikal El Abed,et al. A Concept for the Separation of Foreground/ Background in Arabic Historical Manuscripts using Hybrid Methods , 2006, VAST.

[3] Adel M. Alimi,et al. Can fractal dimension be used in font classification , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[4] F. Hausdorff. Dimension und äußeres Maß , 1918 .

[5] Sargur N. Srihari,et al. A statistical approach to line segmentation in handwritten documents , 2007, Electronic Imaging.

[6] Adnan Amin,et al. A Document Skew Detection Method Using the Hough Transform , 2000, Pattern Analysis & Applications.

[7] Fatos T. Yarman-Vural,et al. Repulsive attractive network for baseline extraction on document images , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8] Nicole Vincent,et al. HOW TO USE FRACTAL DIMENSIONS TO QUALIFY WRITINGS AND WRITERS , 2000 .

[9] Abderrazak Zahour,et al. Arabic hand-written text-line extraction , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[10] K. Falconer. Techniques in fractal geometry , 1997 .

[11] Murray J. J. Holt,et al. Line extraction and stroke ordering of text pages , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[12] Venu Govindaraju,et al. Offline Arabic handwriting recognition: a survey , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Mohammad S. Khorsheed,et al. Off-Line Arabic Character Recognition – A Review , 2002, Pattern Analysis & Applications.

[14] Adel M. Alimi,et al. A methodology for the separation of foreground/background in Arabic historical manuscripts using hybrid methods , 2007, SAC '07.

[15] Yi Li,et al. Detecting Text Lines in Handwritten Documents , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[16] Simon Kristensen,et al. Hausdorff Dimension and Diophantine Approximation , 2003, math/0305399.

[17] Laurence Likforman-Sulem,et al. Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[18] Chellapilla Patvardhan,et al. An optical character recognition system for printed Telugu text , 2004, Pattern Analysis and Applications.

[19] Tommy W. S. Chow,et al. Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density , 2004, Pattern Recognit..