论文信息 - End-to-End Conversion of HTML Tables for Populating a Relational Database

End-to-End Conversion of HTML Tables for Populating a Relational Database

Automating the conversion of human-readable HTML tables into machine-readable relational tables will enable end-user query processing of the millions of data tables found on the web. Theoretically sound and experimentally successful methods for index-based segmentation, extraction of category hierarchies, and construction of a canonical table suitable for direct input to a relational database are demonstrated on 200 heterogeneous web tables. The methods are scalable: the program generates the 198 Access compatible CSV files in ~0.1s per table (two tables could not be indexed).

[1] Jayant Madhavan,et al. Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[2] Carina F. Dorneles,et al. Web table taxonomy and formalization , 2013, SGMD.

[3] George Nagy,et al. Segmenting Tables via Indexing of Value Cells by Table Headers , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[4] Daniel P. Lopresti,et al. A Tabular Survey of Automated Table Processing , 1999, GREC.

[5] Jeffrey D. Ullman,et al. Principles of Database Systems , 1980 .

[6] David W. Embley,et al. Semantically Conceptualizing and Annotating Tables , 2008, ASWC.

[7] Sunita Sarawagi,et al. Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[8] David W. Embley,et al. Factoring web tables , 2011, IEA/AIE'11.

[9] George Nagy,et al. VeriClick: an efficient tool for table format verification , 2011, Electronic Imaging.

[10] Yalin Wang,et al. Detecting Tables in HTML Documents , 2002, Document Analysis Systems.

[11] Xinxin Wang,et al. Tabular Abstraction, Editing, and Formatting , 1996 .

[12] George Nagy. Learning the characteristics of critical cells from web tables , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[13] David W. Embley,et al. Data Extraction from Web Tables: The Devil is in the Details , 2011, 2011 International Conference on Document Analysis and Recognition.

[14] Wolfgang Gatterbauer,et al. Towards domain-independent information extraction from web tables , 2007, WWW '07.

[15] Luís Torgo,et al. Design of an end-to-end method to extract information from tables , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[16] Stefano Ferilli,et al. Finding Critical Cells in Web Tables with SRL: Trying to Uncover the Devil's Tease , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[17] Zhi Tang,et al. Table Header Detection and Classification , 2012, AAAI.

[18] George Nagy,et al. From Tessellations to Table Interpretation , 2009, Calculemus/MKM.

[19] Hanan Samet,et al. Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[20] 関由紀子,et al. Microsoft Access 2010 , 2011 .

[21] Kun Bai,et al. TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[22] Ravi Kumar,et al. A web of concepts , 2009, PODS.

[23] David W. Embley,et al. Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).