End-to-End Conversion of HTML Tables for Populating a Relational Database

Automating the conversion of human-readable HTML tables into machine-readable relational tables will enable end-user query processing of the millions of data tables found on the web. Theoretically sound and experimentally successful methods for index-based segmentation, extraction of category hierarchies, and construction of a canonical table suitable for direct input to a relational database are demonstrated on 200 heterogeneous web tables. The methods are scalable: the program generates the 198 Access compatible CSV files in ~0.1s per table (two tables could not be indexed).

[1]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[2]  Carina F. Dorneles,et al.  Web table taxonomy and formalization , 2013, SGMD.

[3]  George Nagy,et al.  Segmenting Tables via Indexing of Value Cells by Table Headers , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[4]  Daniel P. Lopresti,et al.  A Tabular Survey of Automated Table Processing , 1999, GREC.

[5]  Jeffrey D. Ullman,et al.  Principles of Database Systems , 1980 .

[6]  David W. Embley,et al.  Semantically Conceptualizing and Annotating Tables , 2008, ASWC.

[7]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[8]  David W. Embley,et al.  Factoring web tables , 2011, IEA/AIE'11.

[9]  George Nagy,et al.  VeriClick: an efficient tool for table format verification , 2011, Electronic Imaging.

[10]  Yalin Wang,et al.  Detecting Tables in HTML Documents , 2002, Document Analysis Systems.

[11]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[12]  George Nagy Learning the characteristics of critical cells from web tables , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[13]  David W. Embley,et al.  Data Extraction from Web Tables: The Devil is in the Details , 2011, 2011 International Conference on Document Analysis and Recognition.

[14]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[15]  Luís Torgo,et al.  Design of an end-to-end method to extract information from tables , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[16]  Stefano Ferilli,et al.  Finding Critical Cells in Web Tables with SRL: Trying to Uncover the Devil's Tease , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[17]  Zhi Tang,et al.  Table Header Detection and Classification , 2012, AAAI.

[18]  George Nagy,et al.  From Tessellations to Table Interpretation , 2009, Calculemus/MKM.

[19]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[20]  関 由紀子,et al.  Microsoft Access 2010 , 2011 .

[21]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[22]  Ravi Kumar,et al.  A web of concepts , 2009, PODS.

[23]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).