Bridging the Gap - Using External Knowledge Bases for Context-Aware Document Retrieval

Today, a vast amount of information is made available over the Web in the form of unstructured text indexed by Web search engines. But especially for searches on abstract concepts or context terms, a simple keyword-based Web search may compromise retrieval quality, because query terms may or may not directly occur in the texts (vocabulary problem). The respective state-of-the-art solution is query expansion leading to an increase in recall, although it often also leads to a steep decrease of retrieval precision. This decrease however is a severe problem for digital library providers: in libraries it is vital to ensure high quality retrieval meeting current standards. In this paper we present an approach allowing even for abstract context searches (conceptual queries) with high retrieval quality by using Wikipedia to semantically bridge the gap between query terms and textual content. We do not expand queries, but extract the most important terms from each text document in a focused Web collection and then enrich them with features gathered from Wikipedia. These enriched terms are further used to compute the relevance of a document with respect to a conceptual query. The evaluation shows significant improvements over query expansion approaches: the overall retrieval quality is increased up to 74.5% in mean average precision.

[1]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[2]  Gang Wang,et al.  Understanding user's query intent with wikipedia , 2009, WWW '09.

[3]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[4]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[5]  W. Bruce Croft,et al.  Effective query formulation with multiple information sources , 2012, WSDM '12.

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[8]  Yang Xu,et al.  Query dependent pseudo-relevance feedback based on wikipedia , 2009, SIGIR.

[9]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[10]  Jianfeng Gao,et al.  Towards Concept-Based Translation Models Using Search Logs for Query Expansion , 2012, Proceedings of the 21st ACM international conference on Information and knowledge management.

[11]  Stuart Macdonald,et al.  User Engagement in Research Data Curation , 2009, ECDL.

[12]  Wolf-Tilo Balke,et al.  Context-Sensitive Ranking Using Cross-Domain Knowledge for Chemical Digital Libraries , 2013, TPDL.

[13]  ChengXiang Zhai,et al.  Mining term association patterns from search logs for effective query reformulation , 2008, CIKM '08.

[14]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Ian H. Witten,et al.  A knowledge-based search engine powered by wikipedia , 2007, CIKM '07.

[16]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[17]  James Surowiecki The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations Doubleday Books. , 2004 .

[18]  Philipp Mayr,et al.  Improving Retrieval Results with Discipline-Specific Query Expansion , 2012, TPDL.

[19]  Wolf-Tilo Balke,et al.  Using Wikipedia categories for compact representations of chemical documents , 2010, CIKM '10.

[20]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[21]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[22]  Reiner Kraft,et al.  Mining anchor text for query refinement , 2004, WWW '04.