Context-Sensitive Ranking Using Cross-Domain Knowledge for Chemical Digital Libraries

Today, entity-centric searches are common tasks for information gathering. But, due to the huge amount of available information the entity itself is often not sufficient for finding suitable results. Users are usually searching for entities in a specific search context which is important for their relevance assessment. Therefore, for digital library providers it is inevitable to also consider this search context to allow for high quality retrieval. In this paper we present an approach enabling context searches for chemical entities. Chemical entities play a major role in many specific domains, ranging from biomedical over biology to material science. Since most of the domain specific documents lack of suitable context annotations, we present a similarity measure using cross-domain knowledge gathered from Wikipedia. We show that structure-based similarity measures are not suitable for chemical context searches and introduce a similarity measure combining entity- and context similarity. Our experiments show that our measure outperforms structure-based similarity measures for chemical entities. We compare against two baseline approaches: a Boolean retrieval model and a model using statistical query expansion for the context term. We compared the measures computing mean average precision (MAP) using a set of queries and manual relevance assessments from domain experts. We were able to get a total increase of the MAP of 30% (from 31% to 61%). Furthermore, we show a personalized retrieval system which leads to another increase of around 10%.

[1]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[2]  Reiner Kraft,et al.  Mining anchor text for query refinement , 2004, WWW '04.

[3]  Ravi Kumar,et al.  Searching with context , 2006, WWW '06.

[4]  Wolf-Tilo Balke,et al.  Using Wikipedia categories for compact representations of chemical documents , 2010, CIKM '10.

[5]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[6]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[7]  Wolf-Tilo Balke,et al.  Exposing the hidden web for chemical digital libraries , 2010, JCDL '10.

[8]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[9]  Wolf-Tilo Balke,et al.  Taking chemistry to the task: personalized queries for chemical digital libraries , 2011, JCDL '11.

[10]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[11]  Ingrid Fischer,et al.  Computational life sciences II , 2005 .

[12]  Di Jiang,et al.  Context-aware search personalization with concept preference , 2011, CIKM '11.

[13]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[14]  Yannis Papakonstantinou,et al.  Context-sensitive ranking for document retrieval , 2011, SIGMOD '11.

[15]  Anthony K. H. Tung,et al.  Cross Domain Search by Exploiting Wikipedia , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[16]  C. Lee Giles,et al.  Identifying, Indexing, and Ranking Chemical Formulae and Chemical Names in Digital Documents , 2011, TOIS.