Table structure recognition based on robust block segmentation

This paper presents an efficient approach to identify tabular structures within either electronic or paper documents. The resulting T-Recs system takes word bounding box information as input, and outputs the corresponding logical text block units. Starting with an arbitrary word as block seed the algorithm recursively expands this block to all words that interleave with their vertical neighbors. Since even smallest gaps of table columns prevent their words from mutual interleaving, this initial segmentation is able to identify and isolate such columns. In order to deal with some inherent segmentation errors caused by isolated lines, overhanging words, or cells spawning more than one column, a series of postprocessing steps is added. These steps benefit form a very simple distinction between type 1 and type 2 blocks: type 1 blocks are those of at most one word per line, all others are of type 2. This distinction allows the selective application of heuristics to each group of blocks. The conjoint decomposition of column blocks into subsets of table cells leads to the final block segmentation of a homogeneous abstraction level. These segments serve the final layout analysis which identifies table environments and cells that are stretching over several rows and/or columns.

[1]  Katsuhiko Itonori,et al.  Table structure recognition based on textblock arrangement and ruled line position , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  M. Armon Rahgozar,et al.  Graph-based table recognition system , 1996, Electronic Imaging.

[3]  Y. Hirayama,et al.  A method for table structure analysis using DP matching , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  Koichi Kise,et al.  Document image segmentation as selection of Voronoi edges , 1997, Proceedings Workshop on Document Image Analysis (DIA'97).

[5]  Lawrence O'Gorman,et al.  The Document Spectrum for Bottom-Up Page Layout Analysis , 1993 .

[6]  Rangachar Kasturi,et al.  Structural recognition of tabulated data , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[7]  Kazem Taghva,et al.  Autotag: A Tool for Creating Structured Document Collections from Printed Materials , 1998, EP.

[8]  D. H. Chang,et al.  Extracting Tabular Information From Text Files , 1996 .

[9]  Daniela Rus,et al.  Using White Space for Automated Document Structuring , 1994 .

[10]  Rainer Hoch,et al.  Document analysis at DFKI. - Part 2: Information extraction , 1995 .