How Well do Offline and Online Evaluation Metrics Measure User Satisfaction in Web Image Search?

Comparing to general Web search engines, image search engines present search results differently, with two-dimensional visual image panel for users to scroll and browse quickly. These differences in result presentation can significantly impact the way that users interact with search engines, and therefore affect existing methods of search evaluation. Although different evaluation metrics have been thoroughly studied in the general Web search environment, how those offline and online metrics reflect user satisfaction in the context of image search is an open question. To shed light on this, we conduct a laboratory user study that collects both explicit user satisfaction feedbacks as well as user behavior signals such as clicks. Based on the combination of both externally assessed topical relevance and image quality judgments, offline image search metrics can be better correlated with user satisfaction than merely using topical relevance. We also demonstrate that existing offline Web search metrics can be adapted to evaluate on a two-dimensional presentation for image search. With respect to online metrics, we find that those based on image click information significantly outperform offline metrics. To our knowledge, our work is the first to thoroughly establish the relationship between different measures and user satisfaction in image search.

[1]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[2]  Michael Keen,et al.  ASLIB CRANFIELD RESEARCH PROJECT FACTORS DETERMINING THE PERFORMANCE OF INDEXING SYSTEMS VOLUME 2 , 1966 .

[3]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[4]  Yang Song,et al.  Large-Scale Analysis of Viewing Behavior: Towards Measuring Satisfaction with Mobile Proactive Systems , 2016, CIKM.

[5]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[6]  Ryen W. White,et al.  Personalized models of search satisfaction , 2013, CIKM.

[7]  Yiqun Liu,et al.  Why People Search for Images using Web Search Engines , 2017, WSDM.

[8]  T. Hennig-Thurau,et al.  The impact of customer satisfaction and relationship quality on customer retention: A critical reassessment and model development , 1997 .

[9]  Fan Zhang,et al.  Evaluating Mobile Search with Height-Biased Gain , 2017, SIGIR.

[10]  Meng Wang,et al.  Does Vertical Bring more Satisfaction?: Predicting Search Satisfaction in a Heterogeneous Environment , 2015, CIKM.

[11]  Ahmed Hassan Awadallah,et al.  Beyond DCG: user behavior as a predictor of a successful search , 2010, WSDM '10.

[12]  Tetsuya Sakai,et al.  Summaries, ranked retrieval and sessions: a unified framework for information access evaluation , 2013, SIGIR.

[13]  Madian Khabsa,et al.  Learning to Account for Good Abandonment in Search Success Metrics , 2016, CIKM.

[14]  Rossano Schifanella,et al.  Leveraging User Interaction Signals for Web Image Search , 2016, SIGIR.

[15]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[16]  Eugene Agichtein,et al.  Detecting success in mobile search from interaction , 2011, SIGIR '11.

[17]  Ryen W. White,et al.  Understanding and Predicting Graded Search Satisfaction , 2015, WSDM.

[18]  Brian D. Davison,et al.  Measuring and Predicting Search Engine Users’ Satisfaction , 2016, ACM Comput. Surv..

[19]  Filip Radlinski,et al.  Relevance and Effort: An Analysis of Document Utility , 2014, CIKM.

[20]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[21]  Madian Khabsa,et al.  Is This Your Final Answer?: Evaluating the Effect of Answers on Good Abandonment in Mobile Search , 2016, SIGIR.

[22]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[23]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[24]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[25]  Yang Song,et al.  Modeling action-level satisfaction for search task satisfaction prediction , 2014, SIGIR.

[26]  James Allan,et al.  Predicting searcher frustration , 2010, SIGIR.

[27]  Mark Sanderson,et al.  Performance Measures Used in Image Information Retrieval , 2010, ImageCLEF.

[28]  Yiqun Liu,et al.  Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information , 2015, SIGIR.

[29]  Filip Radlinski,et al.  Predicting Search Satisfaction Metrics with Interleaved Comparisons , 2015, SIGIR.

[30]  Lihong Li,et al.  Toward Predicting the Outcome of an A/B Experiment for Search Relevance , 2015, WSDM.

[31]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[32]  Meng Wang,et al.  Investigating Examination Behavior of Image Search Users , 2017, SIGIR.

[33]  Louise T. Su Evaluation Measures for Interactive Information Retrieval , 1992, Inf. Process. Manag..

[34]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[35]  Mark Sanderson,et al.  A review of factors influencing user satisfaction in information retrieval , 2010 .

[36]  Hsiao-Tieh Pu,et al.  A comparative analysis of web image and textual queries , 2005, Online Inf. Rev..

[37]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[38]  Desney S. Tan,et al.  Designing Novel Image Search Interfaces by Understanding Unique Characteristics and Usage , 2009, INTERACT.

[39]  Mark Sanderson,et al.  A review of factors influencing user satisfaction in information retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[40]  Ryen W. White,et al.  Comparing client and server dwell time estimates for click-level satisfaction prediction , 2014, SIGIR.

[41]  José San Pedro,et al.  Ranking and classifying attractiveness of photos in folksonomies , 2009, WWW '09.

[42]  Filip Radlinski,et al.  Online Evaluation for Information Retrieval , 2016, Found. Trends Inf. Retr..

[43]  Xian-Sheng Hua,et al.  The role of attractiveness in web image search , 2011, ACM Multimedia.

[44]  Scott B. Huffman,et al.  How well does result relevance predict session satisfaction? , 2007, SIGIR.

[45]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[46]  Ximena Olivares,et al.  Visual diversification of image search results , 2009, WWW '09.

[47]  Ravi Kumar,et al.  Optimizing two-dimensional search results presentation , 2011, WSDM '11.

[48]  Yiqun Liu,et al.  When does Relevance Mean Usefulness and User Satisfaction in Web Search? , 2016, SIGIR.

[49]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[50]  Chih-Hung Hsieh,et al.  Towards better measurement of attention and satisfaction in mobile search , 2014, SIGIR.

[51]  Yiqun Liu,et al.  Meta-evaluation of Online and Offline Web Search Evaluation Metrics , 2017, SIGIR.

[52]  M. de Rijke,et al.  Click model-based information retrieval metrics , 2013, SIGIR.

[53]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[54]  Vidit Jain,et al.  Learning to re-rank: query-dependent image re-ranking using click data , 2011, WWW.

[55]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[56]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[57]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.