Geometric algorithms and experiments for automated document structuring

We present and analyze algorithms for the automated segmentation and classification of layout structures in electronic documents. The key idea is to use the patterns in the distribution of white space in a document to recognize and interpret its components. The segmentation algorithm divides the document into a hierarchy of logical elements; the classification algorithms classify these divisions as base-text, tables, indented lists, polygonal drawings, and graphs. We present experimental data and discuss an information access application. Our methodology allows the automatic markup of documents (for instance in the sgml format) and the creation of multilevel indices and browsing tools for electronic libraries.

[1]  Masaaki Mizuno,et al.  Document Recognition System with Layout Structure Generator , 1990, MVA.

[2]  Leslie Lamport,et al.  Latex : A Document Preparation System , 1985 .

[3]  Marti A. Hearst Contextualizing Retrieval of Full-Length Documents , 1994 .

[4]  Haruo Asada,et al.  Major components of a complete text reading system , 1992 .

[5]  Sargur N. Srihari,et al.  Classification of newspaper image blocks using texture analysis , 1989, Comput. Vis. Graph. Image Process..

[6]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[7]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[8]  Theo Huibers Detecting the erosion of hierarchic information structures , 1993 .

[9]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[10]  Dexter Kozen,et al.  The Design and Analysis of Algorithms , 1991, Texts and Monographs in Computer Science.

[11]  Devika Subramanian,et al.  Customizing information capture and access , 1997, TOIS.

[12]  Anil K. Jain,et al.  Address block location on envelopes using Gabor filters , 1992, Pattern Recognit..

[13]  Zhigang Fan,et al.  Tabular document recognition , 1994, Electronic Imaging.

[14]  K. S. Baird,et al.  Anatomy of a versatile page reader , 1992, Proc. IEEE.

[15]  Charles F. Goldfarb,et al.  SGML handbook , 1990 .