PageRank without hyperlinks: structural re-ranking using links induced by language models

Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural re-ranking approach to ad hoc information retrieval: we reorder the documents in an initially retrieved set by exploiting asymmetric relationships between them. Specifically, we consider generation links, which indicate that the language model induced from one document assigns high probability to the text of another; in doing so, we take care to prevent bias against long documents. We study a number of re-ranking criteria based on measures of centrality in the graphs formed by generation links, and show that integrating centrality into standard language-model-based retrieval is quite effective at improving precision at top ranks.

[1]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[2]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[3]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[4]  E. Garfield Citation analysis as a tool in journal evaluation. , 1972, Science.

[5]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[6]  Gabriel Pinski,et al.  Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics , 1976, Inf. Process. Manag..

[7]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Winfried K. Grassmann,et al.  Regenerative Analysis and Steady State Distributions for Markov Chains , 1985, Oper. Res..

[10]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[11]  Peter Willett Query-specific automatic document classification , 1985 .

[12]  Gerard Salton,et al.  On the use of spreading activation methods in automatic information , 1988, SIGIR '88.

[13]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[14]  William J. Stewart,et al.  Introduction to the numerical solution of Markov Chains , 1994 .

[15]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[16]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[17]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[18]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[19]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[20]  Kathleen R. McKeown,et al.  Predicting the semantic orientation of adjectives , 1997 .

[21]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[22]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[23]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[24]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[25]  James Allan,et al.  Evaluating a Visual Navigation System for a Digital Library , 1998, ECDL.

[26]  M. KleinbergJon Authoritative sources in a hyperlinked environment , 1999 .

[27]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[28]  Kenney Ng A Maximum Likelihood Ratio Information Retrieval Model , 1999, TREC.

[29]  Wessel Kraaij,et al.  TNO-UT at TREC-9: How Different are Web Documents? , 2000, TREC.

[30]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.

[31]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[32]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[33]  Czelsaw Daniowicz,et al.  Document ranking based upon Markov chains , 2001, Inf. Process. Manag..

[34]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[35]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[36]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[37]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[38]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[39]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[40]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[41]  James Allan,et al.  Relevance models for topic detection and tracking , 2002 .

[42]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[43]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[44]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[45]  C. J. van Rijsbergen,et al.  Investigating the relationship between language model perplexity and IR precision-recall measures , 2003, SIGIR.

[46]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[47]  ChengXiang Zhai,et al.  Error analysis of difficult TREC topics , 2003, SIGIR '03.

[48]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[49]  W. Bruce Croft,et al.  Relevance Models in Information Retrieval , 2003 .

[50]  W. Bruce Croft,et al.  Language Modeling for Information Retrieval , 2010, The Springer International Series on Information Retrieval.

[51]  Mounia Lalmas,et al.  A survey on the use of relevance feedback for information access systems , 2003, The Knowledge Engineering Review.

[52]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[53]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[54]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[55]  John C. Henderson,et al.  Direct Maximization of Average Precision by Hill-Climbing, with a Comparison to a Maximum Entropy Approach , 2004, HLT-NAACL.

[56]  Claudio Carpineto,et al.  Query Difficulty, Robustness, and Selective Application of Query Expansion , 2004, ECIR.

[57]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[58]  W. Bruce Croft,et al.  A Language Modeling Framework for Selective Query Expansion , 2004 .

[59]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[60]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[61]  Tao Tao,et al.  A two-stage mixture model for pseudo feedback , 2004, SIGIR '04.

[62]  Chirag Shah,et al.  Evaluating high accuracy retrieval techniques , 2004, SIGIR '04.

[63]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[64]  Andrew Y. Ng,et al.  Learning random walk models for inducing word dependency distributions , 2004, ICML.

[65]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[66]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.

[67]  Ronald Rosenfeld,et al.  Semi-supervised learning with graphs , 2005 .

[68]  Dragomir R. Radev,et al.  Using Random Walks for Question-focused Sentence Retrieval , 2005, HLT.

[69]  Hua Li,et al.  Improving web search results using affinity graph , 2005, SIGIR '05.

[70]  Fernando Diaz,et al.  UMass Robust 2005: Using Mixtures of Relevance Models for Query Expansion , 2005, TREC.

[71]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[72]  Mirella Lapata,et al.  Collective Content Selection for Concept-to-Text Generation , 2005, HLT.

[73]  Fernando Diaz,et al.  Regularizing ad hoc retrieval scores , 2005, CIKM '05.

[74]  Czeslaw Danilowicz,et al.  Re-ranking method based on inter-document distances , 2005, Inf. Process. Manag..

[75]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[76]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[77]  Günes Erkan,et al.  Language Model-Based Document Clustering Using Random Walks , 2006, NAACL.

[78]  W. Bruce Croft,et al.  Representing clusters for retrieval , 2006, SIGIR.

[79]  Oren Kurland,et al.  Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models , 2006, SIGIR.

[80]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[81]  Günes Erkan Using Biased Random Walks for Focused Summarization , 2006 .

[82]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[83]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[84]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.

[85]  Kevyn Collins-Thompson,et al.  Estimation and use of uncertainty in pseudo-relevance feedback , 2007, SIGIR.