Text characterization by connected component transformations

Worldwide there are many different scripts and languages in common use. Finding text lines and character and word boundaries, where present, are necessary primitive operations for most document processing applications. We have developed a method of handling text lines from several different languages that is robust in the presence of common printing and scanning artifacts. A technique is described by which information about the characteristics of a text line can be determined from a list of the connected pixel components that comprise the image. This technique applies across many languages and scripts that are laid out horizontally. For text comprising Roman type, the location and dimensions of each text line are augmented with positions of the baseline and x-height. Where appropriate, coordinates of space-delimited words and individual character cells are determined. This technique incorporates a computationally inexpensive method for straightening curved lines and segmenting kerned characters and a novel method based on font weight and stress for locating the boundaries of individual characters, even if their images touch.

[1]  Haruo Asada,et al.  Resolving Ambiguity in Segmenting Touching Characters , 1992 .

[2]  A. Lawrence Spitz,et al.  European language determination from image , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[3]  Penelope Sibun,et al.  Language Determination: Natural Language Processing from Scanned Document Images , 1994, ANLP.

[4]  Henry S. Baird,et al.  The skew angle of printed documents , 1995 .