Unsupervised wrapper induction using linked data

This work explores the usage of Linked Data for Web scale Information Extraction and shows encouraging results on the task of Wrapper Induction. We propose a simple knowledge based method which is (i) highly flexible with respect to different domains and (ii) does not require any training material, but exploits Linked Data as background knowledge source to build essential learning resources. The major contribution of this work is a study of how Linked Data - an imprecise, redundant and large-scale knowledge resource - can be used to support Web scale Information Extraction in an effective and efficient way and identify the challenges involved. We show that, for domains that are covered, Linked Data serve as a powerful knowledge resource for Information Extraction. Experiments on a publicly available dataset demonstrate that, under certain conditions, this simple unsupervised approach can achieve competitive results against some complex state of the art that always depends on training data.

[1]  James Fan,et al.  Large Scale Relation Detection , 2010, HLT-NAACL 2010.

[2]  Long Li,et al.  A dynamic learning framework to thoroughly extract structured data from web pages without human efforts , 2012, MDS '12.

[3]  Enrico Motta,et al.  Integration of micro-gravity and geodetic data to constrain shallow system mass changes at Krafla Volcano, N Iceland , 2006 .

[4]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[5]  Peter Mika,et al.  Entity Search Evaluation over Structured Web Data , 2011 .

[6]  Tom Heath,et al.  Linked Data: Evolving the Web into a Global Data Space , 2011, Linked Data.

[7]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[8]  Jens Lehmann,et al.  DBpedia - A Linked Data Hub and Data Source for Web and Enterprise Applications , 2009 .

[9]  Aditya G. Parameswaran,et al.  Optimal schemes for robust web extraction , 2011, Proc. VLDB Endow..

[10]  Timothy W. Finin,et al.  Using Linked Data to Interpret Tables , 2010, COLD.

[11]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[12]  Craig A. Knoblock,et al.  Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction , 2003, IJCAI.

[13]  Nilesh N. Dalvi,et al.  Robust web extraction: an approach based on a probabilistic tree-edit model , 2009, SIGMOD Conference.

[14]  Tim Berners-Lee,et al.  Linked data , 2020, Semantic Web for the Working Ontologist.

[15]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[16]  Wai Lam,et al.  Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach , 2010, IEEE Transactions on Knowledge and Data Engineering.

[17]  Fidel Cacheda,et al.  Finding and Extracting Data Records from Web Pages , 2007, EUC.

[18]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[19]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[20]  Hans Uszkoreit,et al.  Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web , 2012, International Semantic Web Conference.

[21]  Maria Teresa Pazienza,et al.  Semantic turkey: a browser-integrated environment for knowledge acquisition and management , 2012 .

[22]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[23]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[24]  Charles Schafer,et al.  Bootstrapping Information Extraction from Semi-structured Web Pages , 2008, ECML/PKDD.

[25]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[26]  Enrico Motta,et al.  Overcoming Schema Heterogeneity between Linked Semantic Repositories to Improve Coreference Resolution , 2009, ASWC.

[27]  Tomas Grigalis,et al.  Towards web-scale structured web data extraction , 2013, WSDM.