A Short Survey on Online and Offline Methods for Search Quality Evaluation
暂无分享,去创建一个
[1] Gabriella Kazai,et al. User intent and assessor disagreement in web search evaluation , 2013, CIKM.
[2] M. de Rijke,et al. Click Models for Web Search , 2015, Click Models for Web Search.
[3] Filip Radlinski,et al. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.
[4] J. Pearl. Comment: Understanding Simpson’s Paradox , 2013, Probabilistic and Causal Inference.
[5] Gabriella Kazai,et al. An analysis of systematic judging errors in information retrieval , 2012, CIKM.
[6] Filip Radlinski,et al. Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.
[7] Nick Craswell,et al. Random walks on the click graph , 2007, SIGIR.
[8] Javed A. Aslam,et al. IR system evaluation using nugget-based test collections , 2012, WSDM '12.
[9] M. de Rijke,et al. Click model-based information retrieval metrics , 2013, SIGIR.
[10] Nicholas J. Belkin. Salton Award Lecture: People, Interacting with Information , 2015, SIGIR.
[11] Ben Carterette,et al. Incorporating variability in user behavior into systems based evaluation , 2012, CIKM.
[12] Philipp Schaer,et al. Better than Their Reputation? On the Reliability of Relevance Assessments with Students , 2012, CLEF.
[13] Gabriella Kazai,et al. An analysis of human factors and label accuracy in crowdsourcing relevance judgments , 2013, Information Retrieval.
[14] Ron Kohavi,et al. Online controlled experiments at large scale , 2013, KDD.
[15] W. Bruce Croft,et al. Inferring query aspects from reformulations using clustering , 2011, CIKM '11.
[16] Ron Kohavi,et al. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.
[17] Tetsuya Sakai. Bootstrap-Based Comparisons of IR Metrics for Finding One Relevant Document , 2006, AIRS.
[18] Ricardo Baeza-Yates,et al. Design and Implementation of Relevance Assessments Using Crowdsourcing , 2011, ECIR.
[19] Alistair Moffat,et al. Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.
[20] Matthew Lease,et al. Crowdsourcing for information retrieval , 2012, SIGF.
[21] Thorsten Joachims,et al. Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.
[22] Alistair Moffat,et al. Click-based evidence for decaying weight distributions in search effectiveness metrics , 2010, Information Retrieval.
[23] Javed A. Aslam,et al. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.
[24] Paul N. Bennett,et al. Active query selection for learning rankers , 2012, SIGIR '12.
[25] Donna K. Harman,et al. Collaborative information seeking and retrieval , 2006 .
[26] Ben Carterette,et al. Robust test collections for retrieval evaluation , 2007, SIGIR.
[27] Benjamin Piwowarski,et al. A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.
[28] Mark Sanderson,et al. Do user preferences and evaluation measures line up? , 2010, SIGIR.
[29] Stefano Mizzaro,et al. Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics , 2013, ICTIR.
[30] Yue Gao,et al. Learning more powerful test statistics for click-based retrieval evaluation , 2010, SIGIR.
[31] Kalervo Järvelin,et al. Time drives interaction: simulating sessions in diverse searching environments , 2012, SIGIR '12.
[32] Emre Velipasaoglu,et al. Intent-based diversification of web search results: metrics and algorithms , 2011, Information Retrieval.
[33] Julio Gonzalo,et al. A general evaluation measure for document organization tasks , 2013, SIGIR.
[34] Ryen W. White,et al. No clicks, no problem: using cursor movements to understand and improve search , 2011, CHI.
[35] Nicholas J. Belkin,et al. Display time as implicit feedback: understanding task effects , 2004, SIGIR '04.
[36] Katja Hofmann,et al. Estimating interleaved comparison outcomes from historical click data , 2012, CIKM '12.
[37] M. de Rijke,et al. Probabilistic Multileave for Online Retrieval Evaluation , 2015, SIGIR.
[38] Ryen W. White,et al. Leaving so soon?: understanding and predicting web search abandonment rationales , 2012, CIKM.
[39] Ingemar J. Cox,et al. Selecting a Subset of Queries for Acquisition of Further Relevance Judgements , 2011, ICTIR.
[40] Javed A. Aslam,et al. A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.
[41] Emine Yilmaz,et al. Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.
[42] Eugene Agichtein,et al. Beyond dwell time: estimating document relevance from cursor movements and other post-click searcher behavior , 2012, WWW.
[43] James Allan,et al. A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.
[44] Chao Liu,et al. Click chain model in web search , 2009, WWW '09.
[45] Pavel Metrikov,et al. Impact of assessor disagreement on ranking performance , 2012, SIGIR '12.
[46] ChengXiang Zhai,et al. Evaluation of methods for relative comparison of retrieval systems based on clickthroughs , 2009, CIKM.
[47] Ben Carterette,et al. The effect of assessor error on IR system evaluation , 2010, SIGIR.
[48] Nicola Ferro,et al. Injecting user models and time into precision via Markov chains , 2014, SIGIR.
[49] Omar Alonso,et al. Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..
[50] Qinghua Zheng,et al. Dynamic query intent mining from a search log stream , 2013, CIKM.
[51] Lihong Li,et al. Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study , 2015, WWW.
[52] Yang Song,et al. Evaluating and predicting user engagement change with degraded search relevance , 2013, WWW.
[53] Ben Carterette,et al. Document features predicting assessor disagreement , 2013, SIGIR.
[54] Chao Liu,et al. Efficient multiple-click models in web search , 2009, WSDM '09.
[55] Ingemar J. Cox,et al. Topic (query) selection for IR evaluation , 2009, SIGIR.
[56] Benjamin Piwowarski,et al. Precision recall with user modeling (PRUM): Application to structured information retrieval , 2007, TOIS.
[57] Mike Thelwall,et al. Synthesis Lectures on Information Concepts, Retrieval, and Services , 2009 .
[58] Ben Carterette,et al. System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.
[59] Charles L. A. Clarke,et al. On the informativeness of cascade and intent-aware effectiveness measures , 2011, WWW.
[60] Miles Efron,et al. Using Multiple Query Aspects to Build Test Collections without Human Relevance Judgments , 2009, ECIR.
[61] Emine Yilmaz,et al. Representative & Informative Query Selection for Learning to Rank using Submodular Functions , 2015, SIGIR.
[62] Ron Kohavi,et al. Seven pitfalls to avoid when running controlled experiments on the web , 2009, KDD.
[63] Roi Blanco,et al. Repeatable and reliable search system evaluation using crowdsourcing , 2011, SIGIR.
[64] Nick Craswell,et al. An experimental comparison of click position-bias models , 2008, WSDM '08.
[65] Milad Shokouhi,et al. Expected browsing utility for web search evaluation , 2010, CIKM.
[66] Filip Radlinski,et al. Inferring query intent from reformulations and clicks , 2010, WWW '10.
[67] Ashish Agarwal,et al. Overlapping experiment infrastructure: more, better, faster experimentation , 2010, KDD.
[68] Katja Hofmann,et al. A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.
[69] Lihong Li,et al. Counterfactual Estimation and Optimization of Click Metrics for Search Engines , 2014, ArXiv.
[70] Tetsuya Sakai,et al. Designing Test Collections for Comparing Many Systems , 2014, CIKM.
[71] Ron Kohavi,et al. Trustworthy online controlled experiments: five puzzling outcomes explained , 2012, KDD.
[72] Charles L. A. Clarke,et al. Time-based calibration of effectiveness measures , 2012, SIGIR '12.
[73] Peter Bailey,et al. Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.
[74] Ben Carterette,et al. Evaluating multi-query sessions , 2011, SIGIR.
[75] Ben Carterette,et al. Million Query Track 2007 Overview , 2008, TREC.
[76] Charles L. A. Clarke,et al. Efficient construction of large test collections , 1998, SIGIR '98.
[77] Shuguang Han,et al. Contextual evaluation of query reformulations in a search session by user simulation , 2012, CIKM '12.
[78] Lois M. L. Delcambre,et al. Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions , 2008, ECIR.
[79] Filip Radlinski,et al. Relevance and Effort: An Analysis of Document Utility , 2014, CIKM.
[80] Stephen E. Robertson,et al. On the Contributions of Topics to System Evaluation , 2011, ECIR.
[81] Djoerd Hiemstra,et al. Exploiting user disagreement for web search evaluation: an experimental approach , 2014, WSDM.
[82] Nick Craswell,et al. Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.
[83] James Allan,et al. Evaluation over thousands of queries , 2008, SIGIR '08.
[84] Charles L. A. Clarke,et al. The impact of intent selection on diversified search evaluation , 2013, SIGIR.
[85] Shengli Wu,et al. Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.
[86] Filip Radlinski,et al. Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.
[87] Alex Deng,et al. Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments , 2015, WSDM.
[88] M. de Rijke,et al. Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.
[89] Mark Sanderson,et al. Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..
[90] M. de Rijke,et al. Bayesian Ranker Comparison Based on Historical User Interactions , 2015, SIGIR.
[91] Eugene Agichtein,et al. Discovering common motifs in cursor movement data for improving web search , 2014, WSDM.
[92] Ben Carterette,et al. Statistical Significance Testing in Information Retrieval: Theory and Practice , 2014, SIGIR.
[93] Sreenivas Gollapudi,et al. Diversifying search results , 2009, WSDM '09.
[94] Ingemar J. Cox,et al. Optimizing the cost of information retrieval testcollections , 2011, PIKM '11.
[95] Pavel Serdyukov,et al. On the Relation Between Assessor's Agreement and Accuracy in Gamified Relevance Assessment , 2015, SIGIR.
[96] Craig MacDonald,et al. Generalized Team Draft Interleaving , 2015, CIKM.
[97] Gleb Gusev,et al. Future User Engagement Prediction and Its Application to Improve the Sensitivity of Online Experiments , 2015, WWW.
[98] Dean Eckles,et al. Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods , 2013, KDD.
[99] Emine Yilmaz,et al. The maximum entropy method for analyzing retrieval measures , 2005, SIGIR '05.
[100] James Allan,et al. Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes , 2009, SIGIR.
[101] Thorsten Joachims,et al. Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .
[102] Emine Yilmaz,et al. A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.
[103] Ben Carterette,et al. Simulating simple user behavior for system effectiveness evaluation , 2011, CIKM '11.
[104] Floor Sietsma,et al. Evaluating intuitiveness of vertical-aware click models , 2014, SIGIR.
[105] Filip Radlinski,et al. Predicting Search Satisfaction Metrics with Interleaved Comparisons , 2015, SIGIR.
[106] Emine Yilmaz,et al. Inferring document relevance via average precision , 2006, SIGIR '06.
[107] Lihong Li,et al. Toward Predicting the Outcome of an A/B Experiment for Search Relevance , 2015, WSDM.
[108] Steve Fox,et al. Evaluating implicit measures to improve web search , 2005, TOIS.
[109] Ron Kohavi,et al. Seven rules of thumb for web site experimenters , 2014, KDD.
[110] Emine Yilmaz,et al. A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.
[111] Gabriella Kazai,et al. On judgments obtained from a commercial search engine , 2012, SIGIR '12.
[112] Filip Radlinski,et al. How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.
[113] Olivier Chapelle,et al. Expected reciprocal rank for graded relevance , 2009, CIKM.
[114] James Allan,et al. If I Had a Million Queries , 2009, ECIR.
[115] Ben Carterette,et al. Alternative assessor disagreement and retrieval depth , 2012, CIKM '12.
[116] Gabriella Kazai,et al. In Search of Quality in Crowdsourcing for Search Engine Evaluation , 2011, ECIR.
[117] Ben Carterette,et al. Reusable test collections through experimental design , 2010, SIGIR.
[118] Yu Guo,et al. Flexible Online Repeated Measures Experiment , 2015 .
[119] M. de Rijke,et al. A Comparative Study of Click Models for Web Search , 2015, CLEF.
[120] Stephen E. Robertson,et al. On Using Fewer Topics in Information Retrieval Evaluations , 2013, ICTIR.
[121] Falk Scholer,et al. The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation , 2015, SIGIR.
[122] Ben Carterette,et al. Dynamic Test Collections for Retrieval Evaluation , 2015, ICTIR.
[123] Stefano Mizzaro,et al. A Classification of IR Effectiveness Metrics , 2006, ECIR.
[124] James Allan,et al. Minimal test collections for retrieval evaluation , 2006, SIGIR.
[125] Stephen E. Robertson,et al. On per-topic variance in IR evaluation , 2012, SIGIR '12.
[126] Emine Yilmaz,et al. Effect of Intent Descriptions on Retrieval Evaluation , 2014, CIKM.
[127] Mark Sanderson,et al. Quantifying test collection quality based on the consistency of relevance judgements , 2011, SIGIR.
[128] Yiqun Liu,et al. Different Users, Different Opinions: Predicting Search Satisfaction with Mouse Movement Information , 2015, SIGIR.
[129] Charles L. A. Clarke,et al. Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.
[130] Yu Guo,et al. Statistical inference in two-stage online controlled experiments with treatment selection and validation , 2014, WWW.
[131] Qinghua Zheng,et al. Mining query subtopics from search log data , 2012, SIGIR '12.
[132] Ben Carterette,et al. Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.
[133] Stephen E. Robertson,et al. A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.
[134] Michael S. Bernstein,et al. Designing and deploying online field experiments , 2014, WWW.
[135] Ron Kohavi,et al. Responsible editor: R. Bayardo. , 2022 .
[136] Falk Scholer,et al. Judging Relevance Using Magnitude Estimation , 2015, ECIR.
[137] Filip Radlinski,et al. Optimized interleaving for online retrieval evaluation , 2013, WSDM.
[138] Ryen W. White,et al. Modeling dwell time to predict click-level satisfaction , 2014, WSDM.
[139] Ron Kohavi,et al. Online Controlled Experiments and A / B Tests , 2015 .
[140] Craig MacDonald,et al. Sequential Testing for Early Stopping of Online Experiments , 2015, SIGIR.
[141] Emine Yilmaz,et al. Inferring document relevance from incomplete information , 2007, CIKM '07.
[142] Kalervo Järvelin,et al. Simulating Simple and Fallible Relevance Feedback , 2011, ECIR.
[143] Mark D. Smucker,et al. A qualitative exploration of secondary assessor relevance judging behavior , 2014, IIiX.
[144] Rabia Nuray-Turan,et al. Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..
[145] Milad Shokouhi,et al. An uncertainty-aware query selection model for evaluation of IR systems , 2012, SIGIR '12.
[146] Ian Soboroff,et al. Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.
[147] Evangelos Kanoulas,et al. Empirical justification of the gain and discount function for nDCG , 2009, CIKM.
[148] Milad Shokouhi,et al. On correlation of absence time and search effectiveness , 2014, SIGIR.
[149] Djoerd Hiemstra,et al. A Case for Automatic System Evaluation , 2010, ECIR.
[150] Tetsuya Sakai,et al. Evaluating diversified search results using per-intent graded relevance , 2011, SIGIR.