暂无分享,去创建一个
[1] Maliha S. Nash,et al. Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.
[2] Pu Li,et al. Test theory for assessing IR test collections , 2007, SIGIR.
[3] Stephen E. Robertson,et al. A new rank correlation coefficient for information retrieval , 2008, SIGIR '08.
[4] Tague-SutcliffeJean. The pragmatics of information retrieval experimentation, revisited , 1992 .
[5] Ben Carterette,et al. Hypothesis testing with incomplete relevance judgments , 2007, CIKM '07.
[6] Donna K. Harman,et al. Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.
[7] Mónica Marrero,et al. On the measurement of test collection reliability , 2013, SIGIR.
[8] Stefano Mizzaro,et al. Reproduce and Improve , 2018, ACM J. Data Inf. Qual..
[9] J. Shane Culpepper,et al. The effect of pooling and evaluation depth on IR metrics , 2016, Information Retrieval Journal.
[10] Ellen M. Voorhees,et al. The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.
[11] Ben Carterette,et al. On rank correlation and the distance between rankings , 2009, SIGIR.
[12] G. Gescheider. Psychophysics: The Fundamentals , 1997 .
[13] Alistair Moffat,et al. Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.
[14] Djoerd Hiemstra,et al. Relying on topic subsets for system ranking estimation , 2009, CIKM.
[15] Marco Basaldella,et al. Crowdsourcing Relevance Assessments: The Unexpected Benefits of Limiting the Time to Judge , 2016, HCOMP.
[16] Tetsuya Sakai,et al. Alternatives to Bpref , 2007, SIGIR.
[17] Kevin Roitero. CHEERS: CHeap & Engineered Evaluation of Retrieval Systems , 2018, SIGIR.
[18] Amanda Spink,et al. From Highly Relevant to Not Relevant: Examining Different Regions of Relevance , 1998, Inf. Process. Manag..
[19] Emine Yilmaz,et al. Representative & Informative Query Selection for Learning to Rank using Submodular Functions , 2015, SIGIR.
[20] Peter Bailey,et al. UQV100: A Test Collection with Query Variability , 2016, SIGIR.
[21] James E. Bartlett,et al. Organizational research: Determining appropriate sample size in survey research , 2001 .
[22] Lei Han,et al. All Those Wasted Hours: On Task Abandonment in Crowdsourcing , 2019, WSDM.
[23] Stefano Mizzaro,et al. Economic Evaluation of Recommender Systems: A Proposal , 2017, IIR.
[24] Donna K. Harman,et al. Overview of the Reliable Information Access Workshop , 2009, Information Retrieval.
[25] Alistair Moffat,et al. A similarity measure for indefinite rankings , 2010, TOIS.
[26] Matthew Lease,et al. Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments , 2016, HCOMP.
[27] Eero Sormunen,et al. Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.
[28] O. J. Dunn. Multiple Comparisons among Means , 1961 .
[29] Oren Kurland,et al. Query Performance Prediction Using Reference Lists , 2016, ACM Trans. Inf. Syst..
[30] Eddy Maddalena,et al. Crowd Worker Strategies in Relevance Judgment Tasks , 2020, WSDM.
[31] Ellen M. Voorhees,et al. TREC 2014 Web Track Overview , 2015, TREC.
[32] James Allan,et al. Comparing In Situ and Multidimensional Relevance Judgments , 2017, SIGIR.
[33] Jean Tague-Sutcliffe,et al. The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..
[34] Stefano Mizzaro,et al. Improving the Efficiency of Retrieval Effectiveness Evaluation: Finding a Few Good Topics with Clustering? , 2016, IIR.
[35] Alistair Moffat,et al. Models and metrics: IR evaluation as a user process , 2012, ADCS.
[36] Anselm Spoerri,et al. Using the structure of overlap between search results to rank retrieval systems without relevance judgments , 2007, Inf. Process. Manag..
[37] Peter Ingwersen,et al. Dimensions of relevance , 2000, Inf. Process. Manag..
[38] Ellen M. Voorhees,et al. Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.
[39] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .
[40] Falk Scholer,et al. On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation , 2017, ACM Trans. Inf. Syst..
[41] Ingemar J. Cox,et al. On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents , 2012, ECIR.
[42] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.
[43] Eddy Maddalena,et al. Considering Assessor Agreement in IR Evaluation , 2017, ICTIR.
[44] Stefano Mizzaro,et al. Bias and Fairness in Effectiveness Evaluation by Means of Network Analysis and Mixture Models , 2019, IIR.
[45] Eddy Maddalena,et al. On Fine-Grained Relevance Scales , 2018, SIGIR.
[46] Mark Sanderson,et al. Problems with Kendall's tau , 2007, SIGIR.
[47] Allan Hanbury,et al. Assessors Agreement: A Case Study Across Assessor Type, Payment Levels, Query Variations and Relevance Dimensions , 2016, CLEF.
[48] David J. Sheskin,et al. Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .
[49] Klaus Krippendorff,et al. Computing Krippendorff's Alpha-Reliability , 2011 .
[50] Ingemar J. Cox,et al. Selecting a Subset of Queries for Acquisition of Further Relevance Judgements , 2011, ICTIR.
[51] Stefano Mizzaro,et al. Towards Stochastic Simulations of Relevance Profiles , 2019, CIKM.
[52] Donna K. Harman,et al. Overview of the Eighth Text REtrieval Conference (TREC-8) , 1999, TREC.
[53] Eddy Maddalena,et al. IRevalOO: An Object Oriented Framework for Retrieval Evaluation , 2018, SIGIR.
[54] Stephen E. Robertson,et al. Hits hits TREC: exploring IR evaluation results with network analysis , 2007, SIGIR.
[55] Ingemar J. Cox,et al. Prioritizing relevance judgments to improve the construction of IR test collections , 2011, CIKM '11.
[56] Allan Hanbury,et al. The Impact of Fixed-Cost Pooling Strategies on Test Collection Bias , 2016, ICTIR.
[57] Gabriella Kazai. INitiative for the Evaluation of XML Retrieval , 2009, Encyclopedia of Database Systems.
[58] Rabia Nuray-Turan,et al. Automatic ranking of retrieval systems in imperfect environments , 2003, SIGIR '03.
[59] Tetsuya Sakai,et al. On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..
[60] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.
[61] Rong Tang,et al. Towards the Identification of the Optimal Number of Relevance Categories , 1999, J. Am. Soc. Inf. Sci..
[62] J. Aslam,et al. A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .
[63] Norbert Fuhr,et al. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.
[64] Eddy Maddalena,et al. On Transforming Relevance Scales , 2019, CIKM.
[65] Tie-Yan Liu,et al. Learning to rank for information retrieval , 2009, SIGIR.
[66] Hayato Yamana,et al. Overview of the NTCIR-5 WEB Navigational Retrieval Subtask 2 (Navi-2) , 2005, NTCIR.
[67] Emine Yilmaz,et al. Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.
[68] Pengfei Li,et al. On the Effectiveness of Query Weighting for Adapting Rank Learners to New Unlabelled Collections , 2016, CIKM.
[69] Neha Gupta,et al. Modus Operandi of Crowd Workers , 2017, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..
[70] Ahmed Abbasi,et al. Benchmarking Twitter Sentiment Analysis Tools , 2014, LREC.
[71] Fernando Diaz,et al. Vertical selection in the presence of unlabeled verticals , 2010, SIGIR '10.
[72] Stephen E. Robertson,et al. On GMAP: and other transformations , 2006, CIKM '06.
[73] C. J. van Rijsbergen,et al. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW , 2001 .
[74] William Yang Wang. “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection , 2017, ACL.
[75] J. Shane Culpepper,et al. Fewer topics? A million topics? Both?! On topics subsets in test collections , 2020, Inf. Retr. J..
[76] Julián Urbano,et al. Stochastic Simulation of Test Collections: Evaluation Scores , 2018, SIGIR.
[77] Chris Buckley,et al. Topic prediction based on comparative retrieval rankings , 2004, SIGIR '04.
[78] J. Shane Culpepper,et al. On Topic Difficulty in IR Evaluation: The Effect of Systems, Corpora, and System Components , 2019, SIGIR.
[79] Ben Carterette,et al. Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.
[80] Stephen E. Robertson,et al. A few good topics: Experiments in topic set reduction for retrieval evaluation , 2009, TOIS.
[81] Daniele Fanelli,et al. Negative results are disappearing from most disciplines and countries , 2011, Scientometrics.
[82] Elad Yom-Tov,et al. Estimating the query difficulty for information retrieval , 2010, Synthesis Lectures on Information Concepts, Retrieval, and Services.
[83] A. E. Hoerl,et al. Ridge regression: biased estimation for nonorthogonal problems , 2000 .
[84] Tetsuya Sakai,et al. Ranking Retrieval Systems without Relevance Assessments: Revisited , 2010, EVIA@NTCIR.
[85] Cyril Cleverdon,et al. The Cranfield tests on index language devices , 1997 .
[86] Eddy Maddalena,et al. The Impact of Task Abandonment in Crowdsourcing , 2019, IEEE Transactions on Knowledge and Data Engineering.
[87] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .
[88] Ben Carterette,et al. Preference based evaluation measures for novelty and diversity , 2013, SIGIR.
[89] Tamer Elsayed,et al. Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging , 2017, Inf. Process. Manag..
[90] Javed A. Aslam,et al. On the effectiveness of evaluating retrieval systems in the absence of relevance judgments , 2003, SIGIR.
[91] Mark Sanderson,et al. Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.
[92] SaracevicTefko. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance , 2007 .
[93] Guido Zuccon,et al. Overview of the CLEF 2018 Consumer Health Search Task , 2018, CLEF.
[94] Ellen M. Voorhees,et al. TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .
[95] Allan Hanbury,et al. MM: A new Framework for Multidimensional Evaluation of Search Engines , 2018, CIKM.
[96] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .
[97] Stefano Mizzaro,et al. How Many Truth Levels? Six? One Hundred? Even More? Validating Truthfulness of Statements via Crowdsourcing , 2018, CIKM Workshops.
[98] W. Bruce Croft,et al. Search Engines - Information Retrieval in Practice , 2009 .
[99] Leo Breiman,et al. Random Forests , 2001, Machine Learning.
[100] A. E. Eiben,et al. Introduction to Evolutionary Computing 2nd Edition , 2020 .
[101] Julián Urbano,et al. Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation , 2016, Information Retrieval Journal.
[102] Mounia Lalmas,et al. Report on the INEX 2003 workshop , 2004, SIGF.
[103] Alistair Moffat,et al. Statistical power in retrieval experimentation , 2008, CIKM '08.
[104] Tim Berners-Lee,et al. Information Management: A Proposal , 1990 .
[105] Oren Kurland,et al. Predicting Query Performance by Query-Drift Estimation , 2009, TOIS.
[106] Anand Rajaraman,et al. Mining of Massive Datasets , 2011 .
[107] Eddy Maddalena,et al. Let's Agree to Disagree: Fixing Agreement Measures for Crowdsourcing , 2017, HCOMP.
[108] Laurence A. Marschall,et al. Null and Void , 1999 .
[109] Jon Kleinberg,et al. Authoritative sources in a hyperlinked environment , 1999, SODA '98.
[110] Omar Alonso,et al. Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..
[111] Stefano Mizzaro,et al. Reproduce. Generalize. Extend. On Information Retrieval Evaluation without Relevance Judgments , 2018, ACM J. Data Inf. Qual..
[112] J. Knight. Negative results: Null and void , 2003, Nature.
[113] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.
[114] Stefano Mizzaro,et al. HITS Hits Readersourcing: Validating Peer Review Alternatives Using Network Analysis , 2019, BIRNDL@SIGIR.
[115] Guido Zuccon,et al. Understandability Biased Evaluation for Information Retrieval , 2016, ECIR.
[116] Josiane Mothe,et al. Human-Based Query Difficulty Prediction , 2017, ECIR.
[117] J. Shane Culpepper,et al. Improving test collection pools with machine learning , 2014, ADCS.
[118] Olivier Chapelle,et al. Expected reciprocal rank for graded relevance , 2009, CIKM.
[119] Stefano Mizzaro,et al. IR Evaluation without a Common Set of Topics , 2009, ICTIR.
[120] A. P. Dawid,et al. Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .
[121] James Allan,et al. If I Had a Million Queries , 2009, ECIR.
[122] Stefano Mizzaro,et al. Effectiveness Evaluation with a Subset of Topics: A Practical Approach , 2018, SIGIR.
[123] Ben Carterette,et al. Million Query Track 2007 Overview , 2008, TREC.
[124] Anselm Spoerri,et al. How the overlap between the search results of different retrieval systems correlates with document relevance , 2006, ASIST.
[125] Gerard Salton,et al. The SMART Information Retrieval System after 30 years - Panel. , 1991, SIGIR 1991.
[126] Hsin-Hsi Chen,et al. Overview of CLIR Task at the Fourth NTCIR Workshop , 2004, NTCIR.
[127] José Luis Vicedo González,et al. TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..
[128] Emine Yilmaz,et al. Document selection methodologies for efficient and effective learning-to-rank , 2009, SIGIR.
[129] Ivor W. Tsang,et al. Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.
[130] T. Saracevic,et al. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: nature and manifestations of relevance , 2007, J. Assoc. Inf. Sci. Technol..
[131] Rabia Nuray-Turan,et al. Automatic ranking of information retrieval systems using data fusion , 2006, Inf. Process. Manag..
[132] Falk Scholer,et al. Effective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence , 2008, ECIR.
[133] Jong-Hak Lee,et al. Analyses of multiple evidence combination , 1997, SIGIR '97.
[134] Charles L. A. Clarke,et al. The TREC 2006 Terabyte Track , 2006, TREC.
[135] Maarten de Rijke,et al. Balancing Relevance Criteria through Multi-Objective Optimization , 2016, SIGIR.
[136] Shengli Wu,et al. Methods for ranking information retrieval systems without relevance judgments , 2003, SAC '03.
[137] David Zhang,et al. Learning Domain-Invariant Subspace Using Domain Features and Independence Maximization , 2016, IEEE Transactions on Cybernetics.
[138] David Maxwell Chickering,et al. Here or There , 2008, ECIR.
[139] Carsten Eickhoff,et al. Cognitive Biases in Crowdsourcing , 2018, WSDM.
[140] Peter Willett,et al. Document Retrieval Systems , 1988 .
[141] Handbook of Parametric and Nonparametric Statistical Procedures , 2004 .
[142] Stephen E. Robertson,et al. On Using Fewer Topics in Information Retrieval Evaluations , 2013, ICTIR.
[143] Falk Scholer,et al. The Benefits of Magnitude Estimation Relevance Assessments for Information Retrieval Evaluation , 2015, SIGIR.
[144] Noriko Kando,et al. Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science" , 2016, SIGIR Forum.
[145] Daniel E. Rose,et al. Understanding user goals in web search , 2004, WWW '04.
[146] David E. Losada,et al. Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems , 2017, Inf. Process. Manag..
[147] Peter Bailey,et al. Tasks, Queries, and Rankers in Pre-Retrieval Performance Prediction , 2017, ADCS.
[148] Donna K. Harman,et al. The NRRC reliable information access (RIA) workshop , 2004, SIGIR '04.
[149] Ben Carterette,et al. Low-cost and robust evaluation of information retrieval systems , 2008, SIGF.
[150] Tetsuya Sakai,et al. Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.
[151] Ellen M. Voorhees,et al. Overview of the TREC 2004 Robust Retrieval Track , 2004 .
[152] Djoerd Hiemstra,et al. A survey of pre-retrieval query performance predictors , 2008, CIKM '08.
[153] Vannevar Bush,et al. As we may think , 1945, INTR.
[154] Tetsuya Sakai,et al. Designing Test Collections for Comparing Many Systems , 2014, CIKM.
[155] Stephen E. Robertson,et al. On the Contributions of Topics to System Evaluation , 2011, ECIR.
[156] Jakob Grue Simonsen,et al. Evaluation Measures for Relevance and Credibility in Ranked Lists , 2017, ICTIR.
[157] G. Casella,et al. The Bayesian Lasso , 2008 .
[158] F. Massey. The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .
[159] P. Fishburn. Condorcet Social Choice Functions , 1977 .
[160] Franciska de Jong,et al. Retrieval system evaluation: automatic evaluation versus incomplete judgments , 2010, SIGIR '10.
[161] Eddy Maddalena,et al. Do Easy Topics Predict Effectiveness Better Than Difficult Topics? , 2017, ECIR.
[162] Philip J. Corriveau,et al. Study of Rating Scales for Subjective Quality Assessment of High-Definition Video , 2011, IEEE Transactions on Broadcasting.
[163] Ellen M. Voorhees,et al. Overview of the TREC 2004 Robust Track. , 2004 .
[164] Justin Zobel,et al. How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.
[165] Falk Scholer,et al. The effect of threshold priming and need for cognition on relevance calibration and assessment , 2013, SIGIR.
[166] Shengli Wu,et al. Data fusion with estimated weights , 2002, CIKM '02.
[167] Josiane Mothe,et al. Linguistic features to predict query difficulty , 2005, SIGIR 2005.
[168] Josiane Mothe,et al. Query Performance Prediction and Effectiveness Evaluation Without Relevance Judgments: Two Sides of the Same Coin , 2018, SIGIR.
[169] Josiane Mothe,et al. Why do you Think this Query is Difficult?: A User Study on Human Query Prediction , 2016, SIGIR.
[170] James Allan,et al. Minimal test collections for retrieval evaluation , 2006, SIGIR.
[171] Tie-Yan Liu,et al. Learning to Rank for Information Retrieval , 2011 .
[172] Kalyanmoy Deb,et al. A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..
[173] Stefano Mizzaro,et al. How many relevances in information retrieval? , 1998, Interact. Comput..
[174] James Allan,et al. Evaluation over thousands of queries , 2008, SIGIR '08.
[175] Charles L. A. Clarke,et al. Overview of the TREC 2004 Terabyte Track , 2004, TREC.
[176] Stefano Mizzaro,et al. Effectiveness evaluation without human relevance judgments: A systematic analysis of existing methods and of their combinations , 2020, Inf. Process. Manag..
[177] R. Feise. Do multiple outcome measures require p-value adjustment? , 2002, BMC medical research methodology.
[178] Shariq Bashir. Combining pre-retrieval query quality predictors using genetic programming , 2013, Applied Intelligence.
[179] Oren Kurland,et al. Query-performance prediction: setting the expectations straight , 2014, SIGIR.
[180] Ron Kohavi,et al. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.
[181] Mounia Lalmas,et al. Overview of INEX 2004 , 2004, INEX.
[182] Hans Peter Luhn,et al. A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..
[183] Fernando Diaz,et al. Performance prediction using spatial autocorrelation , 2007, SIGIR.
[184] Emine Yilmaz,et al. A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.
[185] Nicola Ferro,et al. Reproducibility Challenges in Information Retrieval Evaluation , 2017, ACM J. Data Inf. Qual..
[186] Milad Shokouhi,et al. An uncertainty-aware query selection model for evaluation of IR systems , 2012, SIGIR '12.
[187] Ian Soboroff,et al. Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.
[188] and software — performance evaluation , .
[189] Peter Emerson,et al. The original Borda count and partial voting , 2013, Soc. Choice Welf..
[190] Djoerd Hiemstra,et al. A Case for Automatic System Evaluation , 2010, ECIR.
[191] Tetsuya Sakai,et al. Topic set size design , 2015, Information Retrieval Journal.
[192] Milad Shokouhi,et al. Community-based bayesian aggregation models for crowdsourcing , 2014, WWW.
[193] João Francisco Valiati,et al. Document-level sentiment classification: An empirical comparison between SVM and ANN , 2013, Expert Syst. Appl..