Preference-based Evaluation Metrics for Web Image Search

Following the success of Cranfield-like evaluation approaches to evaluation in web search, web image search has also been evaluated with absolute judgments of (graded) relevance. However, recent research has found that collecting absolute relevance judgments may be difficult in image search scenarios due to the multi-dimensional nature of relevance for image results. Moreover, existing evaluation metrics based on absolute relevance judgments do not correlate well with search users' satisfaction perceptions in web image search. Unlike absolute relevance judgments, preference judgments do not require that relevance grades be pre-defined, i.e., how many levels to use and what those levels mean. Instead of considering each document in isolation, preference judgments consider a pair of documents and require judges to state their relative preference. Such preference judgments are usually more reliable than absolute judgments since the presence of (at least) two items establishes a certain context. While preference judgments have been studied extensively for general web search, there exists no thorough investigation on how preference judgments and preference-based evaluation metrics can be used to evaluate web image search systems. Compared to general web search, web image search may be an even better fit for preference-based evaluation because of its grid-based presentation style. The limited need for fresh results in web image search also makes preference judgments more reusable than for general web search. In this paper, we provide a thorough comparison of variants of preference judgments for web image search. We find that compared to strict preference judgments, weak preference judgments require less time and have better inter-assessor agreement. We also study how absolute relevance levels of two given images affect preference judgments between them. Furthermore, we propose a preference-based evaluation metric named Preference-Winning-Penalty (PWP) to evaluate and compare between two different image search systems. The proposed PWP metric outperforms existing evaluation metrics based on absolute relevance judgments in terms of agreement to system-level preferences of actual users.

[1]  Changsheng Xu,et al.  Learn to Personalized Image Search From the Photo Sharing Websites , 2012, IEEE Transactions on Multimedia.

[2]  Yongdong Zhang,et al.  Click-boosting multi-modality graph-based reranking for image search , 2014, Multimedia Systems.

[3]  Rossano Schifanella,et al.  Leveraging User Interaction Signals for Web Image Search , 2016, SIGIR.

[4]  Nir Ailon,et al.  Ranking from pairs and triplets: information quality, evaluation methods and query complexity , 2011, WSDM '11.

[5]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[6]  Klaus Berberich,et al.  Transitivity, Time Consumption, and Quality of Preference Judgments in Crowdsourcing , 2017, ECIR.

[7]  Gabriella Kazai,et al.  User intent and assessor disagreement in web search evaluation , 2013, CIKM.

[8]  Vidit Jain,et al.  Learning to re-rank: query-dependent image re-ranking using click data , 2011, WWW.

[9]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[10]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[11]  David Maxwell Chickering,et al.  Here or There , 2008, ECIR.

[12]  M. de Rijke,et al.  Online Exploration for Detecting Shifts in Fresh Intent , 2014, CIKM.

[13]  Yiqun Liu,et al.  Improving Web Image Search with Contextual Information , 2019, CIKM.

[14]  Ernst Heinrich Weber,et al.  De pulsu, resorptione, auditu et tactu. Annotationes anatomicae et physiologicae , 1834 .

[15]  Sven Ove Hansson,et al.  Preference Change : Approaches from Philosophy, Economics and Psychology , 2009 .

[16]  Yiqun Liu,et al.  Why People Search for Images using Web Search Engines , 2017, WSDM.

[17]  Xian-Sheng Hua,et al.  The role of attractiveness in web image search , 2011, ACM Multimedia.

[18]  Jose L Pardo-Vazquez,et al.  The mechanistic foundation of Weber’s law , 2019, Nature Neuroscience.

[19]  Fan Zhang,et al.  How Well do Offline and Online Evaluation Metrics Measure User Satisfaction in Web Image Search? , 2018, SIGIR.

[20]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[21]  Ben Carterette,et al.  Preference based evaluation measures for novelty and diversity , 2013, SIGIR.

[22]  Mark E. Rorvig,et al.  The simple scalability of documents , 1990, J. Am. Soc. Inf. Sci..

[23]  Yiqun Liu,et al.  Grid-based Evaluation Metrics for Web Image Search , 2019, WWW.

[24]  Meng Wang,et al.  Investigating Examination Behavior of Image Search Users , 2017, SIGIR.

[25]  Shaoping Ma,et al.  Constructing an Interaction Behavior Model for Web Image Search , 2018, SIGIR.

[26]  Yong Yu,et al.  Select-the-Best-Ones: A new way to judge relative relevance , 2011, Inf. Process. Manag..

[27]  Yiqun Liu,et al.  The Influence of Image Search Intents on User Behavior and Satisfaction , 2019, WSDM.

[28]  Klaus Berberich,et al.  Low-Cost Preference Judgment via Ties , 2017, ECIR.

[29]  Yiyu Yao,et al.  Measuring Retrieval Effectiveness Based on User Preference of Documents , 1995, J. Am. Soc. Inf. Sci..

[30]  Yiqun Liu,et al.  On Annotation Methodologies for Image Search Evaluation , 2019, ACM Trans. Inf. Syst..