PageRank without hyperlinks: Reranking with PubMed related article networks for biomedical text retrieval

BackgroundGraph analysis algorithms such as PageRank and HITS have been successful in Web environments because they are able to extract important inter-document relationships from manually-created hyperlinks. We consider the application of these techniques to biomedical text retrieval. In the current PubMed® search interface, a MEDLINE® citation is connected to a number of related citations, which are in turn connected to other citations. Thus, a MEDLINE record represents a node in a vast content-similarity network. This article explores the hypothesis that these networks can be exploited for text retrieval, in the same manner as hyperlink graphs on the Web.ResultsWe conducted a number of reranking experiments using the TREC 2005 genomics track test collection in which scores extracted from PageRank and HITS analysis were combined with scores returned by an off-the-shelf retrieval engine. Experiments demonstrate that incorporating PageRank scores yields significant improvements in terms of standard ranked-retrieval metrics.ConclusionThe link structure of content-similarity networks can be exploited to improve the effectiveness of information retrieval systems. These results generalize the applicability of graph analysis algorithms to text retrieval in the biomedical domain.

[1]  Ellen M. Voorhees,et al.  The fourteenth text retrieval conference TREC 2005 , 2006 .

[2]  Neil Salkind Encyclopedia of Measurement and Statistics , 2006 .

[3]  Dragomir R. Radev,et al.  LexPageRank: Prestige in Multi-Document Text Summarization , 2004, EMNLP.

[4]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices , 1966 .

[5]  Jimmy J. Lin,et al.  How do users find things with PubMed?: towards automatic utility evaluation with user simulations , 2008, SIGIR '08.

[6]  James Allan,et al.  Find-similar: similarity browsing as a search tool , 2006, SIGIR.

[7]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[8]  Donna K. Harman,et al.  The TREC Test Collections , 2005 .

[9]  Jimmy J. Lin,et al.  Navigating information spaces: A case study of related article search in PubMed , 2008, Inf. Process. Manag..

[10]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[11]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[12]  W. John Wilbur,et al.  The Effectiveness of Document Neighboring in Search Enhancement , 1994, Inf. Process. Manag..

[13]  Ellen M. Vdorhees The cluster hypothesis revisited , 1985, SIGIR 1985.

[14]  Jimmy J. Lin,et al.  PubMed related articles: a probabilistic topic-based model for content similarity , 2007, BMC Bioinformatics.

[15]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[16]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[17]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[18]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[19]  Marti A. Hearst,et al.  TREC 2004 Genomics Track Overview , 2005, TREC.

[20]  Keke Chen,et al.  Model Formulation: A Document Clustering and Ranking System for Exploring MEDLINE Citations , 2007, J. Am. Medical Informatics Assoc..

[21]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[22]  Fernando Diaz,et al.  Regularizing query-based retrieval scores , 2007, Information Retrieval.

[23]  Luo Si,et al.  York University at TREC 2007: Genomics Track , 2005, TREC.

[24]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[25]  Peter Pirolli,et al.  Information Foraging , 2009, Encyclopedia of Database Systems.

[26]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[27]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.