Early Steps Towards Web Scale Information Extraction with LODIE

Information extraction (IE) is the technique for transforming unstructured textual data into structured representation that can be understood by machines. The exponential growth of the Web generates an exceptional quantity of data for which automatic knowledge capture is essential. This work describes the methodology for web scale information extraction in the LODIE project (linked open data information extraction) and highlights results from the early experiments carried out in the initial phase of the project. LODIE aims to develop information extraction techniques able to scale at web level and adapt to user information needs. The core idea behind LODIE is the usage of linked open data, a very large-scale information resource, as a ground-breaking solution for IE, which provides invaluable annotated data on a growing number of domains. This article has two objectives. First, describing the LODIE project as a whole and depicting its general challenges and directions. Second, describing some initial steps taken towards the general solution, focusing on a specific IE subtask, wrapper induction.

[1]  Ryan Gabbard,et al.  Extreme Extraction – Machine Reading in a Week , 2011, EMNLP.

[2]  Isabelle Augenstein,et al.  Statistical Knowledge Patterns: Identifying Synonymous Relations in Large Linked Datasets , 2013, International Semantic Web Conference.

[3]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[4]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[5]  Deborah L. McGuinness,et al.  When owl: sameAs Isn't the Same: An Analysis of Identity in Linked Data , 2010, SEMWEB.

[6]  Ziqi Zhang,et al.  WIT: Web People Search Disambiguation using Random Walks , 2007, SemEval@ACL.

[7]  Zhi-Hua Zhou,et al.  Editing Training Data for kNN Classifiers with Neural Network Ensemble , 2004, ISNN.

[8]  Luis Gravano,et al.  Snowball: a prototype system for extracting relations from large text collections , 2001, SIGMOD '01.

[9]  Matemática,et al.  Society for Industrial and Applied Mathematics , 2010 .

[10]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[11]  Krisztian Balog,et al.  Overview of the TREC 2010 Entity Track , 2010, TREC.

[12]  Craig A. Knoblock,et al.  Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction , 2003, IJCAI.

[13]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[14]  Craig A. Knoblock,et al.  Discovering Concept Coverings in Ontologies of Linked Data Sources , 2012, International Semantic Web Conference.

[15]  Craig A. Knoblock,et al.  Creating Relational Data from Unstructured and Ungrammatical Data Sources , 2008, J. Artif. Intell. Res..

[16]  Jens Lehmann,et al.  DBpedia - A Linked Data Hub and Data Source for Web and Enterprise Applications , 2009 .

[17]  Ziqi Zhang,et al.  Semantic Relatedness Approach for Named Entity Disambiguation , 2010, IRCDL.

[18]  Ziqi Zhang,et al.  A Novel Approach to Automatic Gazetteer Generation using Wikipedia , 2009, PWNLP@IJCNLP.

[19]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[20]  Aditya G. Parameswaran,et al.  Optimal schemes for robust web extraction , 2011, Proc. VLDB Endow..

[21]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[22]  Isabelle Augenstein,et al.  Unsupervised wrapper induction using linked data , 2013, K-CAP.

[23]  Rajeev Rastogi,et al.  Web-scale information extraction with vertex , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[24]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[25]  Timothy W. Finin,et al.  Using Linked Data to Interpret Tables , 2010, COLD.

[26]  Nilesh N. Dalvi,et al.  Robust web extraction: an approach based on a probabilistic tree-edit model , 2009, SIGMOD Conference.

[27]  Wai Lam,et al.  Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach , 2010, IEEE Transactions on Knowledge and Data Engineering.

[28]  Ravi Kumar,et al.  Automatic Wrappers for Large Scale Web Extraction , 2011, Proc. VLDB Endow..

[29]  Jens Lehmann,et al.  Triplify: light-weight linked data publication from relational databases , 2009, WWW '09.

[30]  Enrico Motta,et al.  Scaling Up Question-Answering to Linked Data , 2010, EKAW.

[31]  Charles Schafer,et al.  Bootstrapping Information Extraction from Semi-structured Web Pages , 2008, ECML/PKDD.

[32]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[33]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[34]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[35]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[36]  Peter Mika,et al.  Entity Search Evaluation over Structured Web Data , 2011 .

[37]  Pang-Ning Tan,et al.  Kernel Based Detection of Mislabeled Training Examples , 2007, SDM.

[38]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.