Document Structure Analysis Based on Layout and Textual Features

Document image processing is a crucial process in the office automation and begins from the ’OCR’ phase with difficulty of the document ’analysis’ and ’understanding’. This paper presents a hybrid and comprehensive approach to document structure analysis. Hybrid in the sense, that it makes use of layout (geometrical) as well as textual features of a given document. These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base.

[1]  Rolf Brugger,et al.  Eine statistische Methode zur Erkennung von Dokumentstrukturen , 1999 .

[2]  Takashi Saitoh,et al.  User-defined template for identifying document type and extracting information from documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[3]  Donato Malerba,et al.  WISDOM++: an interactive and adaptive document analysis system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[4]  Yuan Yan Tang,et al.  Document Processing for Automatic Knowledge Acquisition , 1994, IEEE Trans. Knowl. Data Eng..

[5]  Frans Coenen,et al.  Region description and comparative analysis using a tesseral representation , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[7]  S. Tsujimoto,et al.  Understanding multi-articled documents , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[8]  Robert M. Haralick,et al.  Document image understanding: geometric and logical layout , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[9]  T.A. Bayer,et al.  Experiments on extracting structural information from paper documents using syntactic pattern analysis , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[10]  Hanno Walischewski,et al.  Automatic knowledge acquisition for spatial document interpretation , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[12]  James F. Allen Towards a General Theory of Action and Time , 1984, Artif. Intell..

[13]  Claudia Wenzel Supporting information extraction from printed documents by Lexico-Semantic pattern matching , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[14]  John E. Hopcroft,et al.  Automatic Discovery of Logical Document Structure , 1998 .

[15]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[16]  Markus Junker,et al.  Learning Complex Patterns for Document Categorization , 1998 .

[17]  Xuhong Li,et al.  A document classification and extraction system with learning ability , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).