Effective and efficient Semantic Table Interpretation using TableMiner+

This article introduces TableMiner+, a Semantic Table Interpretation method that annotates Web tables in a both effective and efficient way. Built on our previous work TableMiner, the extended version advances state-of-the-art in several ways. First, it improves annotation accuracy by making innovative use of various types of contextual information both inside and outside tables as features for inference. Second, it reduces computational overheads by adopting an incremental, bootstrapping approach that starts by creating preliminary and partial annotations of a table using ‘sample’ data in the table, then using the outcome as ‘seed’ to guide interpretation of remaining contents. This is then followed by a message passing process that iteratively refines results on the entire table to create the final optimal annotations. Third, it is able to handle all annotation tasks of Semantic Table Interpretation (e.g., annotating a column, or entity cells) while state-of-the-art methods are limited in different ways. We also compile the largest dataset known to date and extensively evaluate TableMiner+ against four baselines and two re-implemented (near-identical, as adaptations are needed due to the use of different knowledge bases) state-of-the-art methods. TableMiner+ consistently outperforms all models under all experimental settings. On the two most diverse datasets covering multiple domains and various table schemata, it achieves improvement in F1 by between 1 and 42 percentage points depending on specific annotation tasks. It also significantly reduces computational overheads in terms of wall-clock time when compared against classic methods that ‘exhaustively’ process the entire table content to build features for inference. As a concrete example, compared against a method based on joint inference implemented with parallel computation, the non-parallel implementation of TableMiner+ achieves significant improvement in learning accuracy and almost orders of magnitude of savings in wall-clock time.

[1]  Tim Finin,et al.  Exploiting a Web of Semantic Data for Interpreting Tables , 2010 .

[2]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[3]  Ziqi Zhang,et al.  A Novel Approach to Automatic Gazetteer Generation using Wikipedia , 2009, PWNLP@IJCNLP.

[4]  Ollivier Haemmerlé,et al.  Fuzzy Annotation of Web Data Tables Driven by a Domain Ontology , 2009, ESWC.

[5]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[6]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[7]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[8]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[9]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[10]  Alessandra Mileo,et al.  Using linked data to mine RDF from wikipedia's tables , 2014, WSDM.

[11]  Ollivier Haemmerlé,et al.  An Ontology-Driven Annotation of Data Tables , 2007, WISE Workshops.

[12]  Heiko Paulheim,et al.  Extending Tables with Data from over a Million Websites , 2014 .

[13]  Lidong Bing,et al.  Web Entity Detection for Semi-structured Text Data Records with Unlabeled Data , 2013, Int. J. Comput. Linguistics Appl..

[14]  Ziqi Zhang,et al.  Named entity recognition : challenges in document annotation, gazetteer construction and disambiguation , 2013 .

[15]  Timothy W. Finin,et al.  T2LD: Interpreting and Representing Tables as Linked Data , 2010, SEMWEB.

[16]  Luis Gravano,et al.  When Speed Has a Price: Fast Information Extraction Using Approximate Algorithms , 2013, Proc. VLDB Endow..

[17]  Satya S. Sahoo,et al.  A Survey of Current Approaches for Mapping of Relational Databases to RDF , 2009 .

[18]  Isabelle Augenstein,et al.  Unsupervised wrapper induction using linked data , 2013, K-CAP.

[19]  Jian Su,et al.  Exploring Various Knowledge in Relation Extraction , 2005, ACL.

[20]  Timothy W. Finin,et al.  Wikipedia as an Ontology for Describing Documents , 2008, ICWSM.

[21]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[22]  Ziqi Zhang,et al.  Recent advances in methods of lexical semantic relatedness – a survey , 2012, Natural Language Engineering.

[23]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[24]  Alessandra Mileo,et al.  Triplifying Wikipedia's Tables , 2013, LD4IE@ISWC.

[25]  Doug Downey,et al.  Methods for exploring and mining tables on Wikipedia , 2013, IDEA@KDD.

[26]  Ziqi Zhang Learning with Partial Data for Semantic Table Interpretation , 2014, EKAW.

[27]  Timothy W. Finin,et al.  RDF123: From Spreadsheets to RDF , 2008, SEMWEB.

[28]  Surajit Chaudhuri,et al.  InfoGather: entity augmentation and attribute discovery by holistic matching with web tables , 2012, SIGMOD Conference.

[29]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[30]  Liliana Ibanescu,et al.  Fuzzy Web Data Tables Integration Guided by an Ontological and Terminological Resource , 2013, IEEE Transactions on Knowledge and Data Engineering.

[31]  Tim Finin,et al.  Automatically Generating Government Linked Data from Tables , 2011, AAAI 2011.

[32]  Claudio Giuliano,et al.  Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature , 2006, EACL.

[33]  Christopher D. Manning,et al.  An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition , 2006, ACL.

[34]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[35]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[36]  Haixun Wang,et al.  Understanding Tables on the Web , 2012, ER.

[37]  Daisy Zhe Wang,et al.  Uncovering the Relational Web , 2008, WebDB.

[38]  Wolfram Wöß,et al.  XLWrap - Querying and Integrating Arbitrary Spreadsheets with SPARQL , 2009, SEMWEB.

[39]  Varish Mulwad,et al.  T2LD - An automatic framework for extracting, interpreting and representing tables as Linked Data , 2010 .

[40]  Ziqi Zhang,et al.  Towards Efficient and Effective Semantic Table Interpretation , 2014, SEMWEB.

[41]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[42]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[43]  J. Cordy,et al.  A Survey of Table Recognition : Models , Observations , Transformations , and Inferences , 2003 .

[44]  Timothy W. Finin,et al.  Semantic Message Passing for Generating Linked Data from Tables , 1999, SEMWEB.

[45]  James A. Hendler,et al.  TWC data-gov corpus: incrementally generating linked government data from data.gov , 2010, WWW '10.

[46]  Ziqi Zhang,et al.  LODIE: Linked Open Data for Web-scale Information Extraction , 2012, SWAIE.

[47]  Michael Granitzer,et al.  Towards Disambiguating Web Tables , 2013, SEMWEB.