Finding Good URLs: Aligning Entities in Knowledge Bases with Public Web Document Representations

In this paper we address the novel task of mapping entities from a knowledge base to public web documents. This task is of relevance for aligning structured data with web documents, e.g., for the purpose of providing equivalent human readable representations of entities or to detect and propagate changes on the web to the knowledge base. An alternative interpretation of the task is to find good public URLs for the entities in a knowledge base. In order to address the task, we adapt and investigate several approaches based on web search and link network analysis. We compare nine approaches including ordinary web search for the text label of an entity as well as link analysis strategies like HITS authority ranking or PageRank. We evaluate the approaches under the aspect of identifying URLs of documents which are good representations of a given entity. In general, our experiments show a significant advantage of label based web search over all other methods. Furthermore, we introduce a filtering technique leveraging semantic typings to boost the performance of virtually all methods.

[1]  Alexander Löser,et al.  Self-supervised web search for any-k complete tuples , 2011, BEWEB '11.

[2]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[4]  Thanh Tran,et al.  One Query to Bind Them All , 2011, COLD.

[5]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[6]  Jianqiang Li,et al.  A Case Study on Linked Data Generation and Consumption , 2008, LDOW.

[7]  Kevin Chen-Chuan Chang,et al.  Object Search: Supporting Structured Queries in Web Search Engines , 2010, HLT-NAACL 2010.

[8]  Jianqiang Li,et al.  Domain Ontology Learning from Websites , 2009, 2009 Ninth Annual International Symposium on Applications and the Internet.

[9]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[10]  Alexander Löser,et al.  FactCrawl: A Fact Retrieval Framework for Full-Text Indices , 2011, WebDB.

[11]  Roi Blanco,et al.  Evaluating ad-hoc object retrieval , 2010, IWEST@ISWC.

[12]  Peter Mika,et al.  Entity Search Evaluation over Structured Web Data , 2011 .

[13]  Thanh Tran,et al.  Ranking support for keyword search on structured data using relevance models , 2011, CIKM '11.

[14]  Atanas Kiryakov,et al.  Towards Semantic Web Information Extraction , 2003 .

[15]  Kristina Lerman,et al.  Populating the Semantic Web , 2004 .

[16]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[17]  Christoph Quix,et al.  Enabling Structured Queries over Unstructured Documents , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[18]  Harry Halpin A Query-Driven Characterization of Linked Data , 2009, LDOW.

[19]  Jing Liu,et al.  Answering Structured Queries on Unstructured Data , 2006, WebDB.

[20]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.