Extracting Tabular Information From Text Files

This paper presents work done in locating and extracting tables and their contents from document images. While most research in the area of table analysis and recognition has focused on analyzing the raster image, our approach builds upon the advances in optical character recognition (OCR) software to preserve the layout of tabular data by means of white space. By using methods to analyze the geometry, syntax, and the semantics of the character data, as well as utilizing some well-known image processing techniques, we are able to 1) isolate embedded tables from documents, and 2) identify table components uch as title blocks, table entries, and footer blocks. Furthermore, the table analysis techniques presented in this paper can also be applied when analyzing blocks of text isolated by traditional methods such as connected component analysis[1] or bounding box [2].

[1]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Haruhiko Kojima,et al.  Table recognition for automated document entry system , 1991, Other Conferences.

[3]  T. Watanabe,et al.  A framework for validating recognized results in understanding table-form document images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[5]  Uma Mahadevan,et al.  Gap metrics for word separation in handwritten lines , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[6]  A. Laurentini,et al.  Identifying and understanding tabular material in compound documents , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[7]  Norihiro Hagita,et al.  Automated entry system for printed documents , 1990, Pattern Recognit..

[8]  Rangachar Kasturi,et al.  Information extraction from tabular drawings , 1994, Electronic Imaging.

[9]  Zhigang Fan,et al.  Tabular document recognition , 1994, Electronic Imaging.

[10]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[11]  Osamu Hori,et al.  Robust table-form structure analysis based on box-driven reasoning , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.