Rule-based document structure understanding with a fuzzy combination of layout and textual features

Abstract. Document image processing is a crucial process in office automation and begins at the ‘OCR’ phase with difficulties in document ‘analysis’ and ‘understanding’. This paper presents a hybrid and comprehensive approach to document structure analysis. Hybrid in the sense that it makes use of layout (geometrical) as well as textual features of a given document. These features are the base for potential conditions which in turn are used to express fuzzy matched rules of an underlying rule base. Rules can be formulated based on features which might be observed within one specific layout object. However, rules can also express dependencies between different layout objects. In addition to its rule driven analysis, which allows an easy adaptation to specific domains with their specific logical objects, the system contains domain-independent markup algorithms for common objects (e.g., lists).

[1]  Xuhong Li,et al.  A document classification and extraction system with learning ability , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  James F. Allen Towards a General Theory of Action and Time , 1984, Artif. Intell..

[3]  Donato Malerba,et al.  An experimental page layout recognition system for office document automatic classification: an integrated approach for inductive generalization , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[4]  Mahesh Viswanathan Analysis of Scanned Documents — a Syntactic Approach , 1992 .

[5]  Takashi Saitoh,et al.  User-defined template for identifying document type and extracting information from documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[7]  John E. Hopcroft,et al.  Automatic Discovery of Logical Document Structure , 1998 .

[8]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[9]  Yuan Yan Tang,et al.  Document Processing for Automatic Knowledge Acquisition , 1994, IEEE Trans. Knowl. Data Eng..

[10]  Robert M. Haralick,et al.  Document image understanding: geometric and logical layout , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[11]  T.A. Bayer,et al.  Experiments on extracting structural information from paper documents using syntactic pattern analysis , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[12]  Claudia Wenzel Supporting information extraction from printed documents by Lexico-Semantic pattern matching , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[13]  Rolf Brugger,et al.  Eine statistische Methode zur Erkennung von Dokumentstrukturen , 1999 .

[14]  Hanno Walischewski,et al.  Automatic knowledge acquisition for spatial document interpretation , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[15]  Thomas G Kieninger,et al.  Table structure recognition based on robust block segmentation , 1998, Electronic Imaging.

[16]  S. Tsujimoto,et al.  Understanding multi-articled documents , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[17]  Frans Coenen,et al.  Region description and comparative analysis using a tesseral representation , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[18]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..