Tabular document recognition

In this paper, we propose an efficient algorithm for recognizing the grid structure within a tabular document. The algorithm has two parts: first a row labeling algorithm groups similar rows into clusters then, a column labeling algorithm identifies the column structure within each cluster. Each column structure is identified by a set of column separation intervals that are computed from the intervals representing the extent of the white spacing between consecutive word fragments. We formally describe a method for finding column separation intervals based on word fragment separation intervals. This method is based on constructing a closure of a set of line intervals under the operation of line intersection. The closure is maintained dynamically in a data structure which facilitates easy access to the elements within the closure. This technique is computationally less expensive than projection and search at the pixel level since word fragment acquisition is already required for document recognition applications.

[1]  T. Watanabe,et al.  Automatic extraction and classification of data items from library cataloging cards by a knowledge-based approach , 1989, International Workshop on Industrial Applications of Machine Intelligence and Vision,.

[2]  Haruo Asada,et al.  Major components of a complete text reading system , 1992 .

[3]  Andreas Dengel,et al.  High Level Document Analysis Guided by Geometric Aspects , 1988, Int. J. Pattern Recognit. Artif. Intell..

[4]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Richard John Beach Setting tables and illustrations with style , 1985 .

[6]  Yoshihiro Shima,et al.  A new method of document structure extraction using generic layout knowledge , 1989, International Workshop on Industrial Applications of Machine Intelligence and Vision,.