Using linked data to mine RDF from wikipedia's tables

The tables embedded in Wikipedia articles contain rich, semi-structured encyclopaedic content. However, the cumulative content of these tables cannot be queried against. We thus propose methods to recover the semantics of Wikipedia tables and, in particular, to extract facts from them in the form of RDF triples. Our core method uses an existing Linked Data knowledge-base to find pre-existing relations between entities in Wikipedia tables, suggesting the same relations as holding for other entities in analogous columns on different rows. We find that such an approach extracts RDF triples from Wikipedia's tables at a raw precision of 40%. To improve the raw precision, we define a set of features for extracted triples that are tracked during the extraction phase. Using a manually labelled gold standard, we then test a variety of machine learning methods for classifying correct/incorrect triples. One such method extracts 7.9 million unique and novel RDF triples from over one million Wikipedia tables at an estimated precision of 81.5%.

[1]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[2]  Timothy W. Finin,et al.  Semantic Message Passing for Generating Linked Data from Tables , 1999, SEMWEB.

[3]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[4]  James A. Hendler,et al.  Data-gov Wiki: Towards Linking Government Data , 2010, AAAI Spring Symposium: Linked Data Meets Artificial Intelligence.

[5]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[6]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[7]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[8]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[9]  Jun'ichi Tsujii,et al.  A method to integrate tables of the World Wide Web , 2001 .

[10]  Eric Crestan,et al.  A fine-grained taxonomy of tables on the web , 2010, CIKM '10.

[11]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[12]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[13]  York Sure-Vetter,et al.  Transforming arbitrary tables into logical form with TARTAR , 2007, Data Knowl. Eng..

[14]  Timothy W. Finin,et al.  T2LD: Interpreting and Representing Tables as Linked Data , 2010, SEMWEB.

[15]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[16]  Ian Witten,et al.  Data Mining , 2000 .

[17]  Tim Finin,et al.  Exploiting a Web of Semantic Data for Interpreting Tables , 2010 .

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[19]  Alessandra Mileo,et al.  Triplifying Wikipedia's Tables , 2013, LD4IE@ISWC.

[20]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[21]  Eric Crestan,et al.  Web-scale table census and classification , 2011, WSDM '11.

[22]  Matthew Hurst,et al.  Layout and Language: Challenges for Table Understanding on the Web , 2001 .

[23]  Michael Granitzer,et al.  Towards Disambiguating Web Tables , 2013, SEMWEB.

[24]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .