Automated conversion of structured documents into SGML

Intelligent document understanding (IDU) systems convert scanned document pages into an electronic format which preserves layout and logical document structure in addition to document content. MOst of the IDU experimental systems, however, lack the capability of full exploitation of recognition results. In this paper we present an integrated IDU system that processes documents all the way from recognition to full utilization using standard generalized markup language (SGML). The standardization and widespread use of SGML-based tools provides the means for filling the gap between document recognition and seamless document reuse. The conversion process involves OCR of a multipage document, document structure analysis, processing of tabular data and mathematical expressions, and generation of the final SGML description. Document structure analysis is reduce here to parsing OCR results and recreating document structure by performing fuzzy searches for standard phrases and format analysis. Tabular data processing utilizes OCR results with positional data, horizontal lines and heuristic rules to determine cell boundaries and contents. Recognition of mathematical expressions involves OCR on an extended symbol set, and equation structure recognition via transformations on a tree representation. The transformations are ordered and involve connecting of separated symbols, context-sensitive OCR correction, extraction of horizontally aligned subexpressions, subscript and superscript processing, and a general processing of symbols detected above or below the target symbol.

[1]  Robert H. Anderson Syntax-directed recognition of hand-printed two-dimensional mathematics , 1967, Symposium on Interactive Systems for Experimental Applied Mathematics.

[2]  King-Sun Fu,et al.  Syntactic Methods in Pattern Recognition , 1974, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Masayuki Okamoto,et al.  Structure analysis and recognition of mathematical expressions , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  P. A. Chou,et al.  Recognition of Equations Using a Two-Dimensional Stochastic Context-Free Grammar , 1989, Other Conferences.

[5]  R. H. Anderson,et al.  Two-Dimensional Mathematical Notation , 1977 .