Structure recognition of various kinds of table-form documents

The recognition of the structure of a document is to discriminate the layout structure, i.e., the two-dimensional configuration and format, of the document, and to identify the individual item data. Most of the studies of this kind so far, however, are based on the paradigm for the document structure discrimination, where the information concerning the document structure is defined beforehand for a particular type of document and is utilized as the knowledge-base. Such a paradigm is successful in recognizing the same document structure or document structure of the same kind, but is not applicable to the case where various kinds of document structures are mixed. This paper addresses table-form documents as the objects of processing, and reports on a method which can recognize the document structures for various kinds of table-form documents. Various classes of table-form documents with various configurations and contents are available according to its use and adjacent relationship between item fields. To recognize exactly the document structure for various kinds of table-form documents, it is essential to develop the processing method based on the information for each class of table-form documents. For this purpose, the classification tree is used, which hierarchically manages the information for each case of table-form documents. A structure recognition system for multiple kinds of table-form documents, is realized with this framework, including the recognition of table-form document class, the automatic acquisition of layout structure information and the recognition of document structure.