Combining inverted indices and structured search for ad-hoc object retrieval

Retrieving semi-structured entities to answer keyword queries is an increasingly important feature of many modern Web applications. The fast-growing Linked Open Data (LOD) movement makes it possible to crawl and index very large amounts of structured data describing hundreds of millions of entities. However, entity retrieval approaches have yet to find efficient and effective ways of ranking and navigating through those large data sets. In this paper, we address the problem of Ad-hoc Object Retrieval over large-scale LOD data by proposing a hybrid approach that combines IR and structured search techniques. Specifically, we propose an architecture that exploits an inverted index to answer keyword queries as well as a semi-structured database to improve the search effectiveness by automatically generating queries over the LOD graph. Experimental results show that our ranking algorithms exploiting both IR and graph indices outperform state-of-the-art entity retrieval techniques by up to 25% over the BM25 baseline.

[1]  Krisztian Balog,et al.  Overview of the TREC 2011 Entity Track , 2011, TREC.

[2]  M. de Rijke,et al.  Ranking related entities: components and analyses , 2010, CIKM.

[3]  M. de Rijke,et al.  Query modeling for entity search based on terms, categories, and examples , 2011, TOIS.

[4]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[5]  Roi Blanco,et al.  Effective and Efficient Entity Search in RDF Data , 2011, SEMWEB.

[6]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[7]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[8]  M. de Rijke,et al.  A language modeling framework for expert finding , 2009, Inf. Process. Manag..

[9]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[10]  Peter Bailey,et al.  Overview of the TREC 2008 Enterprise Track , 2008, TREC.

[11]  Wolfgang Nejdl,et al.  Leveraging personal metadata for Desktop search: The Beagle++ system , 2010, J. Web Semant..

[12]  Gerhard Weikum,et al.  The RDF-3X engine for scalable management of RDF data , 2010, The VLDB Journal.

[13]  Jaap Kamps,et al.  Entity ranking using Wikipedia as a pivot , 2010, CIKM.

[14]  Gianluca Demartini,et al.  Overview of the INEX 2009 Entity Ranking Track , 2009, INEX.

[15]  Roi Blanco,et al.  Keyword search over RDF graphs , 2011, CIKM '11.

[16]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[17]  Michael Healy,et al.  Theory and Applications of Ontology: Computer Applications , 2010 .

[18]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[19]  Haofen Wang,et al.  Lightweight integration of IR and DB for scalable hybrid search with integrated ranking support , 2011, J. Web Semant..

[20]  Krisztian Balog,et al.  Overview of the TREC 2010 Entity Track , 2010, TREC.

[21]  Roi Blanco,et al.  Repeatable and reliable search system evaluation using crowdsourcing , 2011, SIGIR.

[22]  Peter Mika,et al.  Ad-hoc object retrieval in the web of data , 2010, WWW '10.

[23]  Roi Blanco,et al.  TAER: time-aware entity retrieval-exploiting the past to find relevant entities in news articles , 2010, CIKM.

[24]  Giovanni Tummarello,et al.  A Node Indexing Scheme for Web Entity Retrieval , 2010, ESWC.

[25]  Roi Blanco,et al.  Evaluating ad-hoc object retrieval , 2010, IWEST@ISWC.

[26]  Gianluca DemartiniClaudiu Why finding entities in Wikipedia is difficult, sometimes , 2010 .