How does clickthrough data reflect retrieval quality?

Automatically judging the quality of retrieval functions based on observable user behavior holds promise for making retrieval evaluation faster, cheaper, and more user centered. However, the relationship between observable user behavior and retrieval quality is not yet fully understood. We present a sequence of studies investigating this relationship for an operational search engine on the arXiv.org e-print archive. We find that none of the eight absolute usage metrics we explore (e.g., number of clicks, frequency of query reformulations, abandonment) reliably reflect retrieval quality for the sample sizes we consider. However, we find that paired experiment designs adapted from sensory analysis produce accurate and reliable statements about the relative quality of two retrieval functions. In particular, we investigate two paired comparison tests that analyze clickthrough data from an interleaved presentation of ranking pairs, and we find that both give accurate and consistent results. We conclude that both paired comparison tests give substantially more accurate and sensitive evaluation results than absolute usage metrics in our domain.

[1]  Benjamin Piwowarski,et al.  Web Search Engine Evaluation Using Clickthrough Data and a User Model , 2007 .

[2]  Susan T. Dumais,et al.  Learning user interaction models for predicting web search result preferences , 2006, SIGIR.

[3]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[4]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[5]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[6]  Jaime Teevan,et al.  Implicit feedback for inferring user preference: a bibliography , 2003, SIGF.

[7]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[8]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[9]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[10]  Jane Reid,et al.  A Task-Oriented Non-Interactive Evaluation Methodology for Information Retrieval Systems , 2000, Information Retrieval.

[11]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[12]  Józef Kozielecki Psychological Decision Theory , 1981 .

[13]  Yiqun Liu,et al.  Automatic search engine performance evaluation with click-through data analysis , 2007, WWW '07.

[14]  C. C. Chang,et al.  On the relationship between click rate and relevance for search engines , 2006 .

[15]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[16]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[17]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[18]  David Maxwell Chickering,et al.  Here or there: preference judgments for relevance , 2008 .

[19]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[20]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[21]  Emine Yilmaz,et al.  A Sampling Technique for Ecien tly Estimating Measures of Query Retrieval Performance Using Incomplete Judgments , 2005 .

[22]  Scott B. Huffman,et al.  How well does result relevance predict session satisfaction? , 2007, SIGIR.

[23]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[24]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .