A document classification and extraction system with learning ability

Document image processing begins at the OCR phase with the difficulty of automatic document analysis and understanding. Most existing systems only do well in their specific application domains. In this paper, we describe a domain-independent automatic document image understanding system with learning ability. A segmentation method based on "logical closeness" is proposed. A novel and natural representation of document layout structure-a directed weight graph (DWG)-is described. To classify a given document, a string representation matching algorithm is applied first, instead of comparing all the sample graphs. A frame template and a document type hierarchy (DTH) are used to represent the document's logical structure and the hierarchical relationships among these frame templates, respectively. In this paper, two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm.

[1]  Jason Tsong-Li Wang,et al.  Nested segmentation: an approach for layout analysis in document classification , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  Peter A. Ng,et al.  Automatic document classification and extraction system (adoces) , 1999 .

[3]  Robert M. Haralick,et al.  Document image understanding: geometric and logical layout , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Donato Malerba,et al.  An experimental page layout recognition system for office document automatic classification: an integrated approach for inductive generalization , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[5]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[6]  Yuan Yan Tang,et al.  Document Processing for Automatic Knowledge Acquisition , 1994, IEEE Trans. Knowl. Data Eng..

[7]  Peter A. Ng,et al.  Intelligent browser for TEXPROS , 1997, Proceedings Intelligent Information Systems. IIS'97.