Understanding multi-articled documents
暂无分享,去创建一个
A document understanding method based on the tree representation of document structures is proposed. It is shown that documents have an obvious hierarchical structure in their geometry which is represented by a tree. A small number of rules are introduced to transform the geometric structure into the logical structure which represents the semantics. The virtual field separator technique is employed to utilize the information carried by special constituents of documents such as field separators and frames, keeping the number of transformation rules small. Experimental results on a variety of document formats have shown that the proposed method is applicable to most of the documents commonly encountered in daily use, although there is still room for further refinement of the transformation rules.<<ETX>>
[1] Toshikazu Kato,et al. MACSYM: A hierarchical parallel image processing system for event-driven pattern understanding of documents , 1984, Pattern Recognit..
[2] Friedrich M. Wahl,et al. Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..
[3] Taizo Iijima,et al. A Theory of Character Recognition by Pattern Matching Method , 1974 .