A Fast Algorithm for Bottom-Up Document Layout Analysis

This paper describes a new bottom-up method for document layout analysis. The algorithm was implemented in the CLIDE (Chemical Literature Data Extraction) system, but the method described here is suitable for a broader range of documents. It is based on Kruskal's algorithm and uses a special distance-metric between the components to construct the physical page structure. The method has all the major advantages of bottom-up systems: independence from different text spacing and independence from different block alignments. The algorithms computational complexity is reduced to linear by using heuristics and path-compression.

[1]  Yuan Yan Tang,et al.  Document Processing for Automatic Knowledge Acquisition , 1994, IEEE Trans. Knowl. Data Eng..

[2]  Ki Hwan Kim JMP, Version 2. Software for Statistical Visualization on the Apple Macintosh , 1992, J. Chem. Inf. Comput. Sci..

[3]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[4]  Kuo-Chin Fan,et al.  Segmentation and classification of mixed text/graphics/image documents , 1994, Pattern Recognit. Lett..

[5]  Lewis M. Norton,et al.  Integrating Natural Language Understanding with Document Structure Analysis , 1994 .

[6]  Mahesh Viswanathan,et al.  Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[8]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[9]  Henry S. Baird,et al.  Language-free layout analysis , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[10]  Haruo Asada,et al.  Major components of a complete text reading system , 1992 .

[11]  Takashi Saitoh,et al.  Document Image Segmentation and Layout Analysis (Special Issue on Document Analysis and Recognition) , 1994 .

[12]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  A. Peter Johnson,et al.  Chemical literature data extraction: The CLiDE Project , 1993, J. Chem. Inf. Comput. Sci..