Data Extraction from Web Tables: The Devil is in the Details

We present a method based on header paths for efficient and complete extraction of labeled data from tables meant for humans. Although many table configurations yield to the proposed syntactic analysis, some require access to semantic knowledge. Clicking on one or two critical cells per table, through a simple interface, is sufficient to resolve most of these problem tables. Header paths, a purely syntactic representation of visual tables, can be transformed ("factored") into existing representations of structured data such as category trees, relational tables, and RDF triples. From a random sample of 200 web tables from ten large statistical web sites, we generated 376 relational tables and 34,110 subject-predicate-object RDF triples.

[1]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[2]  Wolfgang Gatterbauer,et al.  Using visual cues for extraction of tabular data from arbitrary HTML documents , 2005, WWW '05.

[3]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[4]  David W. Embley,et al.  Towards Ontology Generation from Tables , 2005, World Wide Web.

[5]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[6]  Ling Liu,et al.  Encyclopedia of Database Systems , 2009, Encyclopedia of Database Systems.

[7]  Richard Zanibbi,et al.  A survey of table recognition: Models , 2004 .

[8]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[9]  George Nagy,et al.  From Tessellations to Table Interpretation , 2009, Calculemus/MKM.

[10]  Vanessa Long,et al.  An RDF-Based Blackboard Architecture for Improving Table Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[11]  P. R. Stephan,et al.  SIS : A System for Sequential Circuit Synthesis , 1992 .

[12]  John C. Handley,et al.  Table analysis for multiline cell identification , 2000, IS&T/SPIE Electronic Imaging.

[13]  Jun'ichi Tsujii,et al.  A method to integrate tables of the World Wide Web , 2001 .

[14]  York Sure-Vetter,et al.  Transforming arbitrary tables into logical form with TARTAR , 2007, Data Knowl. Eng..

[15]  Thomas Bayer Understanding structured text documents by a model based document analysis system , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).