论文信息 - Cheap IR evaluation - 字舞流文

Cheap IR evaluation

To evaluate Information Retrieval (IR) effectiveness, a possible approach is to use test collections, which are composed of a collection of documents, a set of description of information needs (called topics), and a set of relevant documents to each topic. Test collections are modelled in a competition scenario: for example, in the well known TREC initiative, participants run their own retrieval systems over a set of topics and they provide a ranked list of retrieved documents; some of the retrieved documents (usually the first ranked) constitute the so called pool, and their relevance is evaluated by human assessors; the document list is then used to compute effectiveness metrics and rank the participant systems. Private Web Search companies also run their in-house evaluation exercises; although the details are mostly unknown, and the aims are somehow different, the overall approach shares several issues with the test collection approach. The aim of this work is to: (i) develop and improve some state-of-the-art work on the evaluation of IR effectiveness while saving resources, and (ii) propose a novel, more principled and engineered, overall approach to test collection based effectiveness evaluation. [...]

Kevin Roitero | Kevin Roitero

[1] Maliha S. Nash,et al. Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[2] Pu Li,et al. Test theory for assessing IR test collections , 2007, SIGIR.

[3] Stephen E. Robertson,et al. A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.

[4] Tague-SutcliffeJean. The pragmatics of information retrieval experimentation, revisited , 1992 .

[5] Ben Carterette,et al. Hypothesis testing with incomplete relevance judgments , 2007, CIKM '07.

[6] Donna K. Harman,et al. Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[7] Mónica Marrero,et al. On the measurement of test collection reliability , 2013, SIGIR.

[8] Stefano Mizzaro,et al. Reproduce and Improve , 2018, ACM J. Data Inf. Qual..

[9] J. Shane Culpepper,et al. The effect of pooling and evaluation depth on IR metrics , 2016, Information Retrieval Journal.

[10] Ellen M. Voorhees,et al. The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[11] Ben Carterette,et al. On rank correlation and the distance between rankings , 2009, SIGIR.

[12] G. Gescheider. Psychophysics: The Fundamentals , 1997 .

[13] Alistair Moffat,et al. Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[14] Djoerd Hiemstra,et al. Relying on topic subsets for system ranking estimation , 2009, CIKM.

[15] Marco Basaldella,et al. Crowdsourcing Relevance Assessments: The Unexpected Benefits of Limiting the Time to Judge , 2016, HCOMP.

[16] Tetsuya Sakai,et al. Alternatives to Bpref , 2007, SIGIR.

[17] Kevin Roitero. CHEERS: CHeap & Engineered Evaluation of Retrieval Systems , 2018, SIGIR.

[18] Amanda Spink,et al. From Highly Relevant to Not Relevant: Examining Different Regions of Relevance , 1998, Inf. Process. Manag..

[19] Emine Yilmaz,et al. Representative & Informative Query Selection for Learning to Rank using Submodular Functions , 2015, SIGIR.

[20] Peter Bailey,et al. UQV100: A Test Collection with Query Variability , 2016, SIGIR.

[21] James E. Bartlett,et al. Organizational research: Determining appropriate sample size in survey research , 2001 .

[22] Lei Han,et al. All Those Wasted Hours: On Task Abandonment in Crowdsourcing , 2019, WSDM.

[23] Stefano Mizzaro,et al. Economic Evaluation of Recommender Systems: A Proposal , 2017, IIR.

[24] Donna K. Harman,et al. Overview of the Reliable Information Access Workshop , 2009, Information Retrieval.

[25] Alistair Moffat,et al. A similarity measure for indefinite rankings , 2010, TOIS.

[26] Matthew Lease,et al. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments , 2016, HCOMP.

[27] Eero Sormunen,et al. Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[28] O. J. Dunn. Multiple Comparisons among Means , 1961 .

[29] Oren Kurland,et al. Query Performance Prediction Using Reference Lists , 2016, ACM Trans. Inf. Syst..

[30] Eddy Maddalena,et al. Crowd Worker Strategies in Relevance Judgment Tasks , 2020, WSDM.

[31] Ellen M. Voorhees,et al. TREC 2014 Web Track Overview , 2015, TREC.

[32] James Allan,et al. Comparing In Situ and Multidimensional Relevance Judgments , 2017, SIGIR.

[33] Jean Tague-Sutcliffe,et al. The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[34] Stefano Mizzaro,et al. Improving the Efficiency of Retrieval Effectiveness Evaluation: Finding a Few Good Topics with Clustering? , 2016, IIR.

[35] Alistair Moffat,et al. Models and metrics: IR evaluation as a user process , 2012, ADCS.

[36] Anselm Spoerri,et al. Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..

[37] Peter Ingwersen,et al. Dimensions of relevance , 2000, Inf. Process. Manag..

[38] Ellen M. Voorhees,et al. Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[39] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[40] Falk Scholer,et al. On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation , 2017, ACM Trans. Inf. Syst..

[41] Ingemar J. Cox,et al. On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.

[42] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[43] Eddy Maddalena,et al. Considering Assessor Agreement in IR Evaluation , 2017, ICTIR.

[44] Stefano Mizzaro,et al. Bias and Fairness in Effectiveness Evaluation by Means of Network Analysis and Mixture Models , 2019, IIR.

[45] Eddy Maddalena,et al. On Fine-Grained Relevance Scales , 2018, SIGIR.

[46] Mark Sanderson,et al. Problems with Kendall's tau , 2007, SIGIR.

[47] Allan Hanbury,et al. Assessors Agreement: A Case Study Across Assessor Type, Payment Levels, Query Variations and Relevance Dimensions , 2016, CLEF.

[48] David J. Sheskin,et al. Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[49] Klaus Krippendorff,et al. Computing Krippendorff's Alpha-Reliability , 2011 .

[50] Ingemar J. Cox,et al. Selecting a Subset of Queries for Acquisition of Further Relevance Judgements , 2011, ICTIR.

[51] Stefano Mizzaro,et al. Towards Stochastic Simulations of Relevance Profiles , 2019, CIKM.

[52] Donna K. Harman,et al. Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.

[53] Eddy Maddalena,et al. IRevalOO: An Object Oriented Framework for Retrieval Evaluation , 2018, SIGIR.

[54] Stephen E. Robertson,et al. Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.

[55] Ingemar J. Cox,et al. Prioritizing relevance judgments to improve the construction of IR test collections , 2011, CIKM '11.

[56] Allan Hanbury,et al. The Impact of Fixed-Cost Pooling Strategies on Test Collection Bias , 2016, ICTIR.

[57] Gabriella Kazai. INitiative for the Evaluation of XML Retrieval , 2009, Encyclopedia of Database Systems.

[58] Rabia Nuray-Turan,et al. Automatic ranking of retrieval systems in imperfect environments , 2003, SIGIR '03.

[59] Tetsuya Sakai,et al. On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..

[60] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[61] Rong Tang,et al. Towards the Identification of the Optimal Number of Relevance Categories , 1999, J. Am. Soc. Inf. Sci..

[62] J. Aslam,et al. A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[63] Norbert Fuhr,et al. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[64] Eddy Maddalena,et al. On Transforming Relevance Scales , 2019, CIKM.

[65] Tie-Yan Liu,et al. Learning to rank for information retrieval , 2009, SIGIR.

[66] Hayato Yamana,et al. Overview of the NTCIR-5 WEB Navigational Retrieval Subtask 2 (Navi-2) , 2005, NTCIR.

[67] Emine Yilmaz,et al. Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[68] Pengfei Li,et al. On the Effectiveness of Query Weighting for Adapting Rank Learners to New Unlabelled Collections , 2016, CIKM.

[69] Neha Gupta,et al. Modus Operandi of Crowd Workers , 2017, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[70] Ahmed Abbasi,et al. Benchmarking Twitter Sentiment Analysis Tools , 2014, LREC.

[71] Fernando Diaz,et al. Vertical selection in the presence of unlabeled verticals , 2010, SIGIR '10.

[72] Stephen E. Robertson,et al. On GMAP: and other transformations , 2006, CIKM '06.

[73] C. J. van Rijsbergen,et al. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW , 2001 .

[74] William Yang Wang. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection , 2017, ACL.

[75] J. Shane Culpepper,et al. Fewer topics? A million topics? Both?! On topics subsets in test collections , 2020, Inf. Retr. J..

[76] Julián Urbano,et al. Stochastic Simulation of Test Collections: Evaluation Scores , 2018, SIGIR.

[77] Chris Buckley,et al. Topic prediction based on comparative retrieval rankings , 2004, SIGIR '04.

[78] J. Shane Culpepper,et al. On Topic Difficulty in IR Evaluation: The Effect of Systems, Corpora, and System Components , 2019, SIGIR.

[79] Ben Carterette,et al. Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.

[80] Stephen E. Robertson,et al. A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.

[81] Daniele Fanelli,et al. Negative results are disappearing from most disciplines and countries , 2011, Scientometrics.

[82] Elad Yom-Tov,et al. Estimating the query difficulty for information retrieval , 2010, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[83] A. E. Hoerl,et al. Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[84] Tetsuya Sakai,et al. Ranking Retrieval Systems without Relevance Assessments: Revisited , 2010, EVIA@NTCIR.

[85] Cyril Cleverdon,et al. The Cranfield tests on index language devices , 1997 .

[86] Eddy Maddalena,et al. The Impact of Task Abandonment in Crowdsourcing , 2019, IEEE Transactions on Knowledge and Data Engineering.

[87] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[88] Ben Carterette,et al. Preference based evaluation measures for novelty and diversity , 2013, SIGIR.

[89] Tamer Elsayed,et al. Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging , 2017, Inf. Process. Manag..

[90] Javed A. Aslam,et al. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.

[91] Mark Sanderson,et al. Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[92] SaracevicTefko. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance , 2007 .

[93] Guido Zuccon,et al. Overview of the CLEF 2018 Consumer Health Search Task , 2018, CLEF.

[94] Ellen M. Voorhees,et al. TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[95] Allan Hanbury,et al. MM: A new Framework for Multidimensional Evaluation of Search Engines , 2018, CIKM.

[96] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[97] Stefano Mizzaro,et al. How Many Truth Levels? Six? One Hundred? Even More? Validating Truthfulness of Statements via Crowdsourcing , 2018, CIKM Workshops.

[98] W. Bruce Croft,et al. Search Engines - Information Retrieval in Practice , 2009 .

[99] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[100] A. E. Eiben,et al. Introduction to Evolutionary Computing 2nd Edition , 2020 .

[101] Julián Urbano,et al. Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation , 2016, Information Retrieval Journal.

[102] Mounia Lalmas,et al. Report on the INEX 2003 workshop , 2004, SIGF.

[103] Alistair Moffat,et al. Statistical power in retrieval experimentation , 2008, CIKM '08.

[104] Tim Berners-Lee,et al. Information Management: A Proposal , 1990 .

[105] Oren Kurland,et al. Predicting Query Performance by Query-Drift Estimation , 2009, TOIS.

[106] Anand Rajaraman,et al. Mining of Massive Datasets , 2011 .

[107] Eddy Maddalena,et al. Let's Agree to Disagree: Fixing Agreement Measures for Crowdsourcing , 2017, HCOMP.

[108] Laurence A. Marschall,et al. Null and Void , 1999 .

[109] Jon Kleinberg,et al. Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[110] Omar Alonso,et al. Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..

[111] Stefano Mizzaro,et al. Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments , 2018, ACM J. Data Inf. Qual..

[112] J. Knight. Negative results: Null and void , 2003, Nature.

[113] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[114] Stefano Mizzaro,et al. HITS Hits Readersourcing: Validating Peer Review Alternatives Using Network Analysis , 2019, BIRNDL@SIGIR.

[115] Guido Zuccon,et al. Understandability Biased Evaluation for Information Retrieval , 2016, ECIR.

[116] Josiane Mothe,et al. Human-Based Query Difficulty Prediction , 2017, ECIR.

[117] J. Shane Culpepper,et al. Improving test collection pools with machine learning , 2014, ADCS.

[118] Olivier Chapelle,et al. Expected reciprocal rank for graded relevance , 2009, CIKM.

[119] Stefano Mizzaro,et al. IR Evaluation without a Common Set of Topics , 2009, ICTIR.

[120] A. P. Dawid,et al. Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[121] James Allan,et al. If I Had a Million Queries , 2009, ECIR.

[122] Stefano Mizzaro,et al. Effectiveness Evaluation with a Subset of Topics: A Practical Approach , 2018, SIGIR.

[123] Ben Carterette,et al. Million Query Track 2007 Overview , 2008, TREC.

[124] Anselm Spoerri,et al. How the overlap between the search results of different retrieval systems correlates with document relevance , 2006, ASIST.

[125] Gerard Salton,et al. The SMART Information Retrieval System after 30 years - Panel. , 1991, SIGIR 1991.

[126] Hsin-Hsi Chen,et al. Overview of CLIR Task at the Fourth NTCIR Workshop , 2004, NTCIR.

[127] José Luis Vicedo González,et al. TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[128] Emine Yilmaz,et al. Document selection methodologies for efficient and effective learning-to-rank , 2009, SIGIR.

[129] Ivor W. Tsang,et al. Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[130] T. Saracevic,et al. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance , 2007, J. Assoc. Inf. Sci. Technol..

[131] Rabia Nuray-Turan,et al. Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..

[132] Falk Scholer,et al. Effective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence , 2008, ECIR.

[133] Jong-Hak Lee,et al. Analyses of multiple evidence combination , 1997, SIGIR '97.

[134] Charles L. A. Clarke,et al. The TREC 2006 Terabyte Track , 2006, TREC.

[135] Maarten de Rijke,et al. Balancing Relevance Criteria through Multi-Objective Optimization , 2016, SIGIR.

[136] Shengli Wu,et al. Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.

[137] David Zhang,et al. Learning Domain-Invariant Subspace Using Domain Features and Independence Maximization , 2016, IEEE Transactions on Cybernetics.

[138] David Maxwell Chickering,et al. Here or There , 2008, ECIR.

[139] Carsten Eickhoff,et al. Cognitive Biases in Crowdsourcing , 2018, WSDM.

[140] Peter Willett,et al. Document Retrieval Systems , 1988 .

[141] Handbook of Parametric and Nonparametric Statistical Procedures , 2004 .

[142] Stephen E. Robertson,et al. On Using Fewer Topics in Information Retrieval Evaluations , 2013, ICTIR.

[143] Falk Scholer,et al. The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation , 2015, SIGIR.

[144] Noriko Kando,et al. Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.

[145] Daniel E. Rose,et al. Understanding user goals in web search , 2004, WWW '04.

[146] David E. Losada,et al. Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems , 2017, Inf. Process. Manag..

[147] Peter Bailey,et al. Tasks, Queries, and Rankers in Pre-Retrieval Performance Prediction , 2017, ADCS.

[148] Donna K. Harman,et al. The NRRC reliable information access (RIA) workshop , 2004, SIGIR '04.

[149] Ben Carterette,et al. Low-cost and robust evaluation of information retrieval systems , 2008, SIGF.

[150] Tetsuya Sakai,et al. Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.

[151] Ellen M. Voorhees,et al. Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[152] Djoerd Hiemstra,et al. A survey of pre-retrieval query performance predictors , 2008, CIKM '08.

[153] Vannevar Bush,et al. As we may think , 1945, INTR.

[154] Tetsuya Sakai,et al. Designing Test Collections for Comparing Many Systems , 2014, CIKM.

[155] Stephen E. Robertson,et al. On the Contributions of Topics to System Evaluation , 2011, ECIR.

[156] Jakob Grue Simonsen,et al. Evaluation Measures for Relevance and Credibility in Ranked Lists , 2017, ICTIR.

[157] G. Casella,et al. The Bayesian Lasso , 2008 .

[158] F. Massey. The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[159] P. Fishburn. Condorcet Social Choice Functions , 1977 .

[160] Franciska de Jong,et al. Retrieval system evaluation: automatic evaluation versus incomplete judgments , 2010, SIGIR '10.

[161] Eddy Maddalena,et al. Do Easy Topics Predict Effectiveness Better Than Difficult Topics? , 2017, ECIR.

[162] Philip J. Corriveau,et al. Study of Rating Scales for Subjective Quality Assessment of High-Definition Video , 2011, IEEE Transactions on Broadcasting.

[163] Ellen M. Voorhees,et al. Overview of the TREC 2004 Robust Track. , 2004 .

[164] Justin Zobel,et al. How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[165] Falk Scholer,et al. The effect of threshold priming and need for cognition on relevance calibration and assessment , 2013, SIGIR.

[166] Shengli Wu,et al. Data fusion with estimated weights , 2002, CIKM '02.

[167] Josiane Mothe,et al. Linguistic features to predict query difficulty , 2005, SIGIR 2005.

[168] Josiane Mothe,et al. Query Performance Prediction and Effectiveness Evaluation Without Relevance Judgments: Two Sides of the Same Coin , 2018, SIGIR.

[169] Josiane Mothe,et al. Why do you Think this Query is Difficult?: A User Study on Human Query Prediction , 2016, SIGIR.

[170] James Allan,et al. Minimal test collections for retrieval evaluation , 2006, SIGIR.

[171] Tie-Yan Liu,et al. Learning to Rank for Information Retrieval , 2011 .

[172] Kalyanmoy Deb,et al. A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[173] Stefano Mizzaro,et al. How many relevances in information retrieval? , 1998, Interact. Comput..

[174] James Allan,et al. Evaluation over thousands of queries , 2008, SIGIR '08.

[175] Charles L. A. Clarke,et al. Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[176] Stefano Mizzaro,et al. Effectiveness evaluation without human relevance judgments: A systematic analysis of existing methods and of their combinations , 2020, Inf. Process. Manag..

[177] R. Feise. Do multiple outcome measures require p-value adjustment? , 2002, BMC medical research methodology.

[178] Shariq Bashir. Combining pre-retrieval query quality predictors using genetic programming , 2013, Applied Intelligence.

[179] Oren Kurland,et al. Query-performance prediction: setting the expectations straight , 2014, SIGIR.

[180] Ron Kohavi,et al. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[181] Mounia Lalmas,et al. Overview of INEX 2004 , 2004, INEX.

[182] Hans Peter Luhn,et al. A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[183] Fernando Diaz,et al. Performance prediction using spatial autocorrelation , 2007, SIGIR.

[184] Emine Yilmaz,et al. A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[185] Nicola Ferro,et al. Reproducibility Challenges in Information Retrieval Evaluation , 2017, ACM J. Data Inf. Qual..

[186] Milad Shokouhi,et al. An uncertainty-aware query selection model for evaluation of IR systems , 2012, SIGIR '12.

[187] Ian Soboroff,et al. Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[188] and software — performance evaluation , .

[189] Peter Emerson,et al. The original Borda count and partial voting , 2013, Soc. Choice Welf..

[190] Djoerd Hiemstra,et al. A Case for Automatic System Evaluation , 2010, ECIR.

[191] Tetsuya Sakai,et al. Topic set size design , 2015, Information Retrieval Journal.

[192] Milad Shokouhi,et al. Community-based bayesian aggregation models for crowdsourcing , 2014, WWW.

[193] João Francisco Valiati,et al. Document-level sentiment classification: An empirical comparison between SVM and ANN , 2013, Expert Syst. Appl..