A Short Survey on Online and Offline Methods for Search Quality Evaluation

Evaluation has always been the cornerstone of scientific development. Scientists come up with hypotheses (models) to explain physical phenomena, and validate these models by comparing their output to observations in nature. A scientific field consists then merely by a collection of hypotheses that could not been disproved (yet) when compared to nature. Evaluation plays the exact key role in the field of information retrieval. Researchers and practitioners develop models to explain the relation between an information need expressed by a person and information contained in available resources, and test these models by comparing their outcomes to collections of observations.

[1]  Gabriella Kazai,et al.  User intent and assessor disagreement in web search evaluation , 2013, CIKM.

[2]  M. de Rijke,et al.  Click Models for Web Search , 2015, Click Models for Web Search.

[3]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[4]  J. Pearl Comment: Understanding Simpson’s Paradox , 2013, Probabilistic and Causal Inference.

[5]  Gabriella Kazai,et al.  An analysis of systematic judging errors in information retrieval , 2012, CIKM.

[6]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[7]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[8]  Javed A. Aslam,et al.  IR system evaluation using nugget-based test collections , 2012, WSDM '12.

[9]  M. de Rijke,et al.  Click model-based information retrieval metrics , 2013, SIGIR.

[10]  Nicholas J. Belkin Salton Award Lecture: People, Interacting with Information , 2015, SIGIR.

[11]  Ben Carterette,et al.  Incorporating variability in user behavior into systems based evaluation , 2012, CIKM.

[12]  Philipp Schaer,et al.  Better than Their Reputation? On the Reliability of Relevance Assessments with Students , 2012, CLEF.

[13]  Gabriella Kazai,et al.  An analysis of human factors and label accuracy in crowdsourcing relevance judgments , 2013, Information Retrieval.

[14]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.

[15]  W. Bruce Croft,et al.  Inferring query aspects from reformulations using clustering , 2011, CIKM '11.

[16]  Ron Kohavi,et al.  Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[17]  Tetsuya Sakai Bootstrap-Based Comparisons of IR Metrics for Finding One Relevant Document , 2006, AIRS.

[18]  Ricardo Baeza-Yates,et al.  Design and Implementation of Relevance Assessments Using Crowdsourcing , 2011, ECIR.

[19]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[20]  Matthew Lease,et al.  Crowdsourcing for information retrieval , 2012, SIGF.

[21]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[22]  Alistair Moffat,et al.  Click-based evidence for decaying weight distributions in search effectiveness metrics , 2010, Information Retrieval.

[23]  Javed A. Aslam,et al.  On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.

[24]  Paul N. Bennett,et al.  Active query selection for learning rankers , 2012, SIGIR '12.

[25]  Donna K. Harman,et al.  Collaborative information seeking and retrieval , 2006 .

[26]  Ben Carterette,et al.  Robust test collections for retrieval evaluation , 2007, SIGIR.

[27]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[28]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[29]  Stefano Mizzaro,et al.  Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics , 2013, ICTIR.

[30]  Yue Gao,et al.  Learning more powerful test statistics for click-based retrieval evaluation , 2010, SIGIR.

[31]  Kalervo Järvelin,et al.  Time drives interaction: simulating sessions in diverse searching environments , 2012, SIGIR '12.

[32]  Emre Velipasaoglu,et al.  Intent-based diversification of web search results: metrics and algorithms , 2011, Information Retrieval.

[33]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[34]  Ryen W. White,et al.  No clicks, no problem: using cursor movements to understand and improve search , 2011, CHI.

[35]  Nicholas J. Belkin,et al.  Display time as implicit feedback: understanding task effects , 2004, SIGIR '04.

[36]  Katja Hofmann,et al.  Estimating interleaved comparison outcomes from historical click data , 2012, CIKM '12.

[37]  M. de Rijke,et al.  Probabilistic Multileave for Online Retrieval Evaluation , 2015, SIGIR.

[38]  Ryen W. White,et al.  Leaving so soon?: understanding and predicting web search abandonment rationales , 2012, CIKM.

[39]  Ingemar J. Cox,et al.  Selecting a Subset of Queries for Acquisition of Further Relevance Judgements , 2011, ICTIR.

[40]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[41]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[42]  Eugene Agichtein,et al.  Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior , 2012, WWW.

[43]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[44]  Chao Liu,et al.  Click chain model in web search , 2009, WWW '09.

[45]  Pavel Metrikov,et al.  Impact of assessor disagreement on ranking performance , 2012, SIGIR '12.

[46]  ChengXiang Zhai,et al.  Evaluation of methods for relative comparison of retrieval systems based on clickthroughs , 2009, CIKM.

[47]  Ben Carterette,et al.  The effect of assessor error on IR system evaluation , 2010, SIGIR.

[48]  Nicola Ferro,et al.  Injecting user models and time into precision via Markov chains , 2014, SIGIR.

[49]  Omar Alonso,et al.  Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..

[50]  Qinghua Zheng,et al.  Dynamic query intent mining from a search log stream , 2013, CIKM.

[51]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.

[52]  Yang Song,et al.  Evaluating and predicting user engagement change with degraded search relevance , 2013, WWW.

[53]  Ben Carterette,et al.  Document features predicting assessor disagreement , 2013, SIGIR.

[54]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[55]  Ingemar J. Cox,et al.  Topic (query) selection for IR evaluation , 2009, SIGIR.

[56]  Benjamin Piwowarski,et al.  Precision recall with user modeling (PRUM): Application to structured information retrieval , 2007, TOIS.

[57]  Mike Thelwall,et al.  Synthesis Lectures on Information Concepts, Retrieval, and Services , 2009 .

[58]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[59]  Charles L. A. Clarke,et al.  On the informativeness of cascade and intent-aware effectiveness measures , 2011, WWW.

[60]  Miles Efron,et al.  Using Multiple Query Aspects to Build Test Collections without Human Relevance Judgments , 2009, ECIR.

[61]  Emine Yilmaz,et al.  Representative & Informative Query Selection for Learning to Rank using Submodular Functions , 2015, SIGIR.

[62]  Ron Kohavi,et al.  Seven pitfalls to avoid when running controlled experiments on the web , 2009, KDD.

[63]  Roi Blanco,et al.  Repeatable and reliable search system evaluation using crowdsourcing , 2011, SIGIR.

[64]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[65]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[66]  Filip Radlinski,et al.  Inferring query intent from reformulations and clicks , 2010, WWW '10.

[67]  Ashish Agarwal,et al.  Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.

[68]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[69]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics for Search Engines , 2014, ArXiv.

[70]  Tetsuya Sakai,et al.  Designing Test Collections for Comparing Many Systems , 2014, CIKM.

[71]  Ron Kohavi,et al.  Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.

[72]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[73]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[74]  Ben Carterette,et al.  Evaluating multi-query sessions , 2011, SIGIR.

[75]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[76]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[77]  Shuguang Han,et al.  Contextual evaluation of query reformulations in a search session by user simulation , 2012, CIKM '12.

[78]  Lois M. L. Delcambre,et al.  Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions , 2008, ECIR.

[79]  Filip Radlinski,et al.  Relevance and Effort: An Analysis of Document Utility , 2014, CIKM.

[80]  Stephen E. Robertson,et al.  On the Contributions of Topics to System Evaluation , 2011, ECIR.

[81]  Djoerd Hiemstra,et al.  Exploiting user disagreement for web search evaluation: an experimental approach , 2014, WSDM.

[82]  Nick Craswell,et al.  Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.

[83]  James Allan,et al.  Evaluation over thousands of queries , 2008, SIGIR '08.

[84]  Charles L. A. Clarke,et al.  The impact of intent selection on diversified search evaluation , 2013, SIGIR.

[85]  Shengli Wu,et al.  Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[86]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[87]  Alex Deng,et al.  Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments , 2015, WSDM.

[88]  M. de Rijke,et al.  Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.

[89]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[90]  M. de Rijke,et al.  Bayesian Ranker Comparison Based on Historical User Interactions , 2015, SIGIR.

[91]  Eugene Agichtein,et al.  Discovering common motifs in cursor movement data for improving web search , 2014, WSDM.

[92]  Ben Carterette,et al.  Statistical Significance Testing in Information Retrieval: Theory and Practice , 2014, SIGIR.

[93]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[94]  Ingemar J. Cox,et al.  Optimizing the cost of information retrieval testcollections , 2011, PIKM '11.

[95]  Pavel Serdyukov,et al.  On the Relation Between Assessor's Agreement and Accuracy in Gamified Relevance Assessment , 2015, SIGIR.

[96]  Craig MacDonald,et al.  Generalized Team Draft Interleaving , 2015, CIKM.

[97]  Gleb Gusev,et al.  Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments , 2015, WWW.

[98]  Dean Eckles,et al.  Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods , 2013, KDD.

[99]  Emine Yilmaz,et al.  The maximum entropy method for analyzing retrieval measures , 2005, SIGIR '05.

[100]  James Allan,et al.  Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes , 2009, SIGIR.

[101]  Thorsten Joachims,et al.  Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[102]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[103]  Ben Carterette,et al.  Simulating simple user behavior for system effectiveness evaluation , 2011, CIKM '11.

[104]  Floor Sietsma,et al.  Evaluating intuitiveness of vertical-aware click models , 2014, SIGIR.

[105]  Filip Radlinski,et al.  Predicting Search Satisfaction Metrics with Interleaved Comparisons , 2015, SIGIR.

[106]  Emine Yilmaz,et al.  Inferring document relevance via average precision , 2006, SIGIR '06.

[107]  Lihong Li,et al.  Toward Predicting the Outcome of an A/B Experiment for Search Relevance , 2015, WSDM.

[108]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[109]  Ron Kohavi,et al.  Seven rules of thumb for web site experimenters , 2014, KDD.

[110]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[111]  Gabriella Kazai,et al.  On judgments obtained from a commercial search engine , 2012, SIGIR '12.

[112]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[113]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[114]  James Allan,et al.  If I Had a Million Queries , 2009, ECIR.

[115]  Ben Carterette,et al.  Alternative assessor disagreement and retrieval depth , 2012, CIKM '12.

[116]  Gabriella Kazai,et al.  In Search of Quality in Crowdsourcing for Search Engine Evaluation , 2011, ECIR.

[117]  Ben Carterette,et al.  Reusable test collections through experimental design , 2010, SIGIR.

[118]  Yu Guo,et al.  Flexible Online Repeated Measures Experiment , 2015 .

[119]  M. de Rijke,et al.  A Comparative Study of Click Models for Web Search , 2015, CLEF.

[120]  Stephen E. Robertson,et al.  On Using Fewer Topics in Information Retrieval Evaluations , 2013, ICTIR.

[121]  Falk Scholer,et al.  The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation , 2015, SIGIR.

[122]  Ben Carterette,et al.  Dynamic Test Collections for Retrieval Evaluation , 2015, ICTIR.

[123]  Stefano Mizzaro,et al.  A Classification of IR Effectiveness Metrics , 2006, ECIR.

[124]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[125]  Stephen E. Robertson,et al.  On per-topic variance in IR evaluation , 2012, SIGIR '12.

[126]  Emine Yilmaz,et al.  Effect of Intent Descriptions on Retrieval Evaluation , 2014, CIKM.

[127]  Mark Sanderson,et al.  Quantifying test collection quality based on the consistency of relevance judgements , 2011, SIGIR.

[128]  Yiqun Liu,et al.  Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information , 2015, SIGIR.

[129]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[130]  Yu Guo,et al.  Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.

[131]  Qinghua Zheng,et al.  Mining query subtopics from search log data , 2012, SIGIR '12.

[132]  Ben Carterette,et al.  Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[133]  Stephen E. Robertson,et al.  A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[134]  Michael S. Bernstein,et al.  Designing and deploying online field experiments , 2014, WWW.

[135]  Ron Kohavi,et al.  Responsible editor: R. Bayardo. , 2022 .

[136]  Falk Scholer,et al.  Judging Relevance Using Magnitude Estimation , 2015, ECIR.

[137]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[138]  Ryen W. White,et al.  Modeling dwell time to predict click-level satisfaction , 2014, WSDM.

[139]  Ron Kohavi,et al.  Online Controlled Experiments and A / B Tests , 2015 .

[140]  Craig MacDonald,et al.  Sequential Testing for Early Stopping of Online Experiments , 2015, SIGIR.

[141]  Emine Yilmaz,et al.  Inferring document relevance from incomplete information , 2007, CIKM '07.

[142]  Kalervo Järvelin,et al.  Simulating Simple and Fallible Relevance Feedback , 2011, ECIR.

[143]  Mark D. Smucker,et al.  A qualitative exploration of secondary assessor relevance judging behavior , 2014, IIiX.

[144]  Rabia Nuray-Turan,et al.  Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..

[145]  Milad Shokouhi,et al.  An uncertainty-aware query selection model for evaluation of IR systems , 2012, SIGIR '12.

[146]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[147]  Evangelos Kanoulas,et al.  Empirical justification of the gain and discount function for nDCG , 2009, CIKM.

[148]  Milad Shokouhi,et al.  On correlation of absence time and search effectiveness , 2014, SIGIR.

[149]  Djoerd Hiemstra,et al.  A Case for Automatic System Evaluation , 2010, ECIR.

[150]  Tetsuya Sakai,et al.  Evaluating diversified search results using per-intent graded relevance , 2011, SIGIR.