Towards Nuanced System Evaluation Based on Implicit User Expectations

Information retrieval systems are often evaluated through the use of effectiveness metrics. In the past, the metrics used have corresponded to fixed models of user behavior, presuming, for example, that the user will view a pre-determined number of items in the search engine results page, or that they have a constant probability of advancing from one item in the result page to the next. Recently, a number of proposals for models of user behavior have emerged that are parameterized in terms of the number of relevant documents (or other material) a user expects to be required to address their information need. That recent work has demonstrated that T, the user’s a priori utility expectation, is correlated with the underlying nature of the information need; and hence that evaluation metrics should be sensitive to T. Here we examine the relationship between the query the user issues, and their anticipated T, seeking syntactic and other clues to guide the subsequent system evaluation. That is, we wish to develop mechanisms that, based on the query alone, can be used to adjust system evaluations so that the experience of the user of the system is better captured in the system’s effectiveness score, and hence can be used as a more refined way of comparing systems. This paper reports on a first round of experimentation, and describes the progress (albeit modest) that we have achieved towards that goal.

[1]  Chris Buckley,et al.  The TREC-8 Query Track , 1999, TREC.

[2]  Tony Ward,et al.  The roar of the crowd , 2009 .

[3]  and software — performance evaluation , .

[4]  Peter Bailey,et al.  Understanding the relationship of information need specificity to search query length , 2007, SIGIR.

[5]  Wei Chu,et al.  Modeling the impact of short- and long-term behavior on search personalization , 2012, SIGIR '12.

[6]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[7]  Nicholas J. Belkin,et al.  Validation of a model of information seeking over multiple search sessions , 2005, J. Assoc. Inf. Sci. Technol..

[8]  Jaime Arguello,et al.  Grannies, tanning beds, tattoos and NASCAR: evaluation of search tasks with varying levels of cognitive complexity , 2012, IIiX.

[9]  Gabriella Kazai,et al.  Overview of the TREC 2012 Crowdsourcing Track , 2012, TREC.

[10]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[11]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[12]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[13]  Alistair Moffat,et al.  What Users Do: The Eyes Have It , 2013, AIRS.

[14]  Thorsten Joachims,et al.  Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[15]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[16]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[17]  H. Akaike A new look at the statistical model identification , 1974 .

[18]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[19]  Peter Bailey,et al.  User Variability and IR System Evaluation , 2015, SIGIR.

[20]  Jaime Arguello,et al.  Development and Evaluation of Search Tasks for IIR Experiments using a Cognitive Complexity Framework , 2015, ICTIR.

[21]  Susan T. Dumais,et al.  To personalize or not to personalize: modeling queries with variation in user intent , 2008, SIGIR '08.

[22]  Benjamin S. Bloom,et al.  A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives , 2000 .