Evaluating Learning-to-Rank Methods in the Web Track Adhoc Task

Learning-to-rank methods are becoming ubiquitous in information retrieval. Their advantage lies in the ability to combine a large number of low-impact relevance signals. This requires large training and test data sets. A large test data set is also needed to verify the usefulness of specific relevance signals (using statistical methods). There are several publicly available data collections geared towards evaluation of learning-to-rank methods. These collections are large, but they typically provide a fixed set of precomputed (and often anonymized) relevance signals. In turn, computing new signals may be impossible. This limitation motivated us to experiment with learning-to-rank methods using the TREC Web adhoc collection. Specifically, we compared performance of learning-to-rank methods with performance of a hand-tuned formula (based on the same set of relevance signals). Even though the TREC data set did not have enough queries to draw definitive conclusions, the hand-tuned formula seemed to be at par with learning-to-rank methods.

[1]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[2]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[4]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[5]  Rajesh Shenoy,et al.  On the robustness of relevance measures with incomplete judgments , 2007, SIGIR.

[6]  Jimmy J. Lin,et al.  UMD and USC/ISI: TREC 2010 Web Track Experiments with Ivory , 2010, TREC.

[7]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[8]  David Hawking,et al.  Predicting Fame and Fortune: PageRank or Indegree? , 2003 .

[9]  Kilian Q. Weinberger,et al.  Web-Search Ranking with Initialized Gradient Boosted Regression Trees , 2010, Yahoo! Learning to Rank Challenge.

[10]  Leonid Boytsov,et al.  Lessons Learned From Indexing Close Word Pairs , 2010, TREC.

[11]  Tao Tao,et al.  An exploration of proximity measures in information retrieval , 2007, SIGIR.

[12]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[13]  Ronan Cummins,et al.  Learning in a pairwise term-term proximity framework for information retrieval , 2009, SIGIR.

[14]  W. Rice ANALYZING TABLES OF STATISTICAL TESTS , 1989, Evolution; international journal of organic evolution.

[15]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[16]  Marc Najork,et al.  Microsoft Research at TREC 2011 Web Track , 2010, TREC.

[17]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[18]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[19]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[20]  Ben He,et al.  Modeling term proximity for probabilistic information retrieval models , 2011, Inf. Sci..

[21]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[22]  Tie-Yan Liu,et al.  Future directions in learning to rank , 2010, Yahoo! Learning to Rank Challenge.

[23]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[24]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[25]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.