Extraction and Integration of Statistical Data from Tables

Tables provide high quality information since they concisely represent items, numeric values, or their relationships. In order to extract information from tables, it is necessary to recognize relationships between attributes (e.g. hierarchies), title phrase outside tables, and null spaces in one or more consecutive row (or column) cells. While tables can represent information in a concise and understandable way, it is not an easy task to integrate multiple tables if the inherent information gets meaningful only after they are assembled and merged into a single table from a scattered set of tables in a single file or in different files. In this paper, we focus on extraction and integration of statistical data (i.e. numeric values) from tables. Our proposed method relies on “ruled lines” surrounding cells in tables. We let rules lines be a clue to get “set relationship” between cells, and to extract hierarchical relationships between attributes and titles outside tables. We also refer to the method of integrating multiply scattered tables into a single table, and to the method of visualization in which we allow users to selectively specify arbitrary period of time and attributes. Keyword tables, table recognition, ruled lines

[1]  David W. Embley,et al.  Notes on Contemporary Table Recognition , 2006, Document Analysis Systems.

[2]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[3]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .