Word and Sentence Extraction Using Irregular Pyramid

This paper presents the result of our continued work on a further enhancement to our previous proposed algorithm. Moving beyond the extraction of word groups and based on the same irregular pyramid structure the new proposed algorithm groups the extracted words into sentences. The uniqueness of the algorithm is in its ability to process text of a wide variation in terms of size, font, orientation and layout on the same document image. No assumption is made on any specified document type. The algorithm is based on the irregular pyramid structure with the application of four fundamental concepts. The first is the inclusion of background information. The second is the concept of closeness where text information within a group is close to each other, in terms of spatial distance, as compared to other text areas. The third is the "majority win" strategy that is more suitable under the greatly varying environment than a constant threshold value. The final concept is the uniformity and continuity among words belonging to the same sentence.

[1]  Azriel Rosenfeld,et al.  Image segmentation by a multiresolution approach , 1993, Pattern Recognit..

[2]  Yalin Wang,et al.  Statistical-based approach to word segmentation , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[3]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Seong-Whan Lee,et al.  Parameter-Free Geometric Document Layout Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[6]  Bidyut Baran Chaudhuri,et al.  Automatic separation of words in multi-lingual multi-script Indian documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[7]  Horace Ho-Shing Ip,et al.  Alternative strategies for irregular pyramid construction , 1996, Image Vis. Comput..

[8]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[9]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Azriel Rosenfeld,et al.  Hierarchical Image Analysis Using Irregular Tessellations , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Toyohide Watanabe,et al.  Character extraction from noisy background for an automatic reference system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[12]  Seong-Whan Lee,et al.  Parameter-independent geometric document layout analysis , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[13]  Chew Lim Tan,et al.  Text extraction using pyramid , 1998, Pattern Recognit..

[14]  Ching Y. Suen,et al.  Segmenting document images using diagonal white runs and vertical edges , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[15]  Chew Lim Tan,et al.  Detection of word groups based on irregular pyramid , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.