Document understanding system using stochastic context-free grammars

We present a document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars. A grammar corresponds to a document genre; our system may be adapted to a new genre simply by replacing the input grammar. The system incorporates an optical character recognition system that outputs characters, their positions and font sizes. These features are combined to form a document representation of lines of text and separators. Lines of text are labeled as tokens using regular expression matching. The maximum likelihood parse of this stream of tokens and separators yields a functional labeling of the document lines. We describe business card and business letter applications.

[1]  Guangshun Shi,et al.  A system for automatic Chinese business card recognition , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[2]  Thomas Kieninger,et al.  Document Structure Analysis Based on Layout and Textual Features , 2000 .

[3]  P. A. Chou,et al.  Recognition of Equations Using a Two-Dimensional Stochastic Context-Free Grammar , 1989, Other Conferences.

[4]  T.A. Bayer,et al.  Experiments on extracting structural information from paper documents using syntactic pattern analysis , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[5]  H. Saiga,et al.  An OCR system for business cards , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[6]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[7]  Hanno Walischewski,et al.  Automatic knowledge acquisition for spatial document interpretation , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[8]  David S. Doermann,et al.  Content features for logical document labeling , 2003, IS&T/SPIE Electronic Imaging.

[9]  Hsi-Jian Lee,et al.  Recognition of Chinese business cards , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[10]  Toyohide Watanabe,et al.  Automatic acquisition of layout knowledge for understanding business cards , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  Naoki Tanaka,et al.  Visiting card understanding system , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.