A tool for classifying office documents

The authors present the design of a tool for classifying office documents. They represent a document's layout structure using an ordered labeled tree, called the layout structure tree (L-S-tree), based on a nested segmentation procedure. The tool uses a sample-based approach for learning, where concepts are learned by retaining samples and new documents are classified by matching their L-S-trees with samples. The matching process involves both computing the edit distance between two trees using a previously developed pattern matching toolkit, and calculating the degree of conceptual closeness between the documents and samples. The experimental results show that the tool is capable of classifying various types of office documents, even with very few samples in the sample base.

[1]  Bernard Pagurek,et al.  Letter pattern recognition , 1990, Sixth Conference on Artificial Intelligence for Applications.

[2]  Jason Tsong-Li Wang,et al.  Nested segmentation: an approach for layout analysis in document classification , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[3]  Eberhard Mandler,et al.  Document analysis-from pixels to contents , 1992 .

[4]  Ray Bareiss,et al.  Concept Learning and Heuristic Classification in WeakTtheory Domains , 1990, Artif. Intell..

[5]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[6]  Klaus Kreplin,et al.  Knowledge based document classification supporting integrated document handling , 1988, COCS '88.

[7]  Maria Grazia Fugini,et al.  Classification and retrieval of documents using office organization knowledge , 1991, COCS '91.

[8]  Ernst Lutz,et al.  MAFIA—an active mail-filter-agent for an intelligent document processing support , 1990, SIGO.

[9]  Jason Tsong-Li Wang,et al.  Texpros: an Intelligent Document Processing System , 1992, Int. J. Softw. Eng. Knowl. Eng..

[10]  Yasuaki Nakano,et al.  Segmentation methods for character recognition: from segmentation to document structure analysis , 1992, Proc. IEEE.

[11]  Kaizhong Zhang,et al.  Approximate Tree Matching in the Presence of Variable Length Don't Cares , 1994, J. Algorithms.

[12]  Andreas Dengel,et al.  ANASTASIL: A Hybrid Knowledge-Based System for Document Layout Analysis , 1989, IJCAI.

[13]  Frank Y. Shih,et al.  A document segmentation, classification and recognition system , 1992, Proceedings of the Second International Conference on Systems Integration.