Revisiting the Evaluation of Diversified Search Evaluation Metrics with User Preferences

To validate the credibility of diversity evaluation metrics, a number of methods that “evaluate evaluation metrics” are adopted in diversified search evaluation studies, such as Kendall’s τ, Discriminative Power, and the Intuitiveness Test. These methods have been widely adopted and have aided us in gaining much insight into the effectiveness of evaluation metrics. However, they also follow certain types of user behaviors or statistical assumptions and do not take the information of users’ actual search preferences into consideration. With multi-grade user preference judgments collected for diversified search result lists displayed parallel, we take user preferences as the ground truth to investigate the evaluation of diversity metrics. We find that user preference at the subtopic level gain similar results with those at the topic level, which means we can use user preference at the topic level with much less human efforts in future experiments. We further find that most existing evaluation metrics correlate with user preferences well for result lists with large performance differences, no matter the differences is detected by the metric or the users. According to these findings, we then propose a preference-weighted correlation, the Multi-grade User Preference (MUP) method, to evaluate the diversity metrics based on user preferences. The experimental results reveal that MUP evaluates diversity metrics from real users’ perspective that may differ from other methods. In addition, we find the relevance of the search result is more important than the diversity of the search result in the diversified search evaluation of our experiments.

[1]  Tetsuya Sakai Evaluation with informational and navigational intents , 2012, WWW.

[2]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[3]  Milad Shokouhi,et al.  Advances in Information Retrieval Theory, Second International Conference on the Theory of Information Retrieval, ICTIR 2009, Cambridge, UK, September 10-12, 2009, Proceedings , 2009, ICTIR.

[4]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[5]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[6]  Yiqun Liu,et al.  Overview of the NTCIR-10 INTENT-2 Task , 2013, NTCIR.

[7]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[8]  Tetsuya Sakai,et al.  Diversified search evaluation: lessons from the NTCIR-9 INTENT task , 2012, Information Retrieval.

[9]  Alistair Moffat,et al.  Seven Numeric Properties of Effectiveness Metrics , 2013, AIRS.

[10]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[11]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[12]  Tetsuya Sakai,et al.  Evaluating diversified search results using per-intent graded relevance , 2011, SIGIR.

[13]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[14]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[15]  Tie-Yan Liu,et al.  Information Retrieval Technology , 2014, Lecture Notes in Computer Science.

[16]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[17]  Charles L. A. Clarke,et al.  An Effectiveness Measure for Ambiguous and Underspecified Queries , 2009, ICTIR.

[18]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[19]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[20]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[21]  Charles L. A. Clarke,et al.  On the informativeness of cascade and intent-aware effectiveness measures , 2011, WWW.