Semi-automatic Data Extraction from Tables

This paper describes a novel approach to automate extraction of useful information from tables and to record the knowledge procured in a structured data repository. The approach is based on modeling a behavior of an expert, who collects tabular data and maps them to a predefined relational schema. Experimental results demonstrate that the proposed approach predicts expert decisions with high accuracy and thus significantly minimizes the time required of an expert for data aggregation.

[1]  Cui Tao,et al.  Automating the extraction of data from HTML tables with unknown structure , 2005, Data Knowl. Eng..

[2]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[3]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[4]  Donato Malerba,et al.  HyLiEn: a hybrid approach to general list extraction on the web , 2011, WWW.

[5]  J. Cordy,et al.  A Survey of Table Recognition : Models , Observations , Transformations , and Inferences , 2003 .

[6]  George Nagy,et al.  VeriClick: an efficient tool for table format verification , 2011, Electronic Imaging.

[7]  David W. Embley,et al.  Towards Ontology Generation from Tables , 2005, World Wide Web.

[8]  Maksim Tkatchenko,et al.  Named entity recognition: Exploring features , 2012, KONVENS.

[9]  Luís Torgo,et al.  Design of an end-to-end method to extract information from tables , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[10]  Richard Zanibbi,et al.  A survey of table recognition: Models , 2004 .

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[13]  Daniel P. Lopresti,et al.  A Tabular Survey of Automated Table Processing , 1999, GREC.

[14]  David W. Embley,et al.  Data Extraction from Web Tables: The Devil is in the Details , 2011, 2011 International Conference on Document Analysis and Recognition.

[15]  David W. Embley,et al.  Factoring web tables , 2011, IEA/AIE'11.