Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors
暂无分享,去创建一个
[1] James Allan,et al. A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.
[2] E. Pitman. Significance Tests Which May be Applied to Samples from Any Populations , 1937 .
[3] James Allan,et al. Evaluation over thousands of queries , 2008, SIGIR '08.
[4] Jacques Savoy,et al. Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..
[5] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .
[6] Mark Sanderson,et al. Size and Source Matter: Understanding Inconsistencies in Test Collection-Based Evaluation , 2014, CIKM.
[7] W. J. Conover,et al. Practical Nonparametric Statistics , 1972 .
[8] Ben Carterette. Bayesian Inference for Information Retrieval Evaluation , 2015, ICTIR.
[9] Mónica Marrero,et al. On the measurement of test collection reliability , 2013, SIGIR.
[10] James Allan,et al. Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes , 2009, SIGIR.
[11] Gordon V. Cormack,et al. Validity and power of t-test for comparing MAP and GMAP , 2007, SIGIR.
[12] Tetsuya Sakai,et al. Topic set size design , 2015, Information Retrieval Journal.
[13] Davis B. McCarn,et al. A mathematical model of retrieval system performance , 1990, J. Am. Soc. Inf. Sci..
[14] Mark Sanderson,et al. Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.
[15] Tetsuya Sakai,et al. Statistical reform in information retrieval? , 2014, SIGF.
[16] Gordon V. Cormack,et al. Statistical precision of information retrieval evaluation , 2006, SIGIR.
[17] Omar Alonso,et al. Using crowdsourcing for TREC relevance assessment , 2012, Inf. Process. Manag..
[18] J. Pratt. Remarks on Zeros and Ties in the Wilcoxon Signed Rank Procedures , 1959 .
[19] H. Joe. Dependence Modeling with Copulas , 2014 .
[20] W. J. Conover,et al. A Note on the Small-Sample Power Functions for Nonparametric Tests of Location in the Double Exponential Family , 1978 .
[21] Ying Zhang,et al. Differences in effectiveness across sub-collections , 2012, CIKM.
[22] Peter Bailey,et al. Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.
[23] J. I. The Design of Experiments , 1936, Nature.
[24] Tetsuya Sakai. Two Sample T-tests for IR Evaluation: Student or Welch? , 2016, SIGIR.
[25] Justin Zobel,et al. How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.
[26] Julián Urbano,et al. Stochastic Simulation of Test Collections: Evaluation Scores , 2018, SIGIR.
[27] Ellen M. Voorhees,et al. Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.
[28] Student,et al. THE PROBABLE ERROR OF A MEAN , 1908 .
[29] David A. Hull. Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.
[30] Julián Urbano,et al. Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation , 2016, Information Retrieval Journal.
[31] W. John Wilbur,et al. Non-parametric significance tests of retrieval performance comparisons , 1994, J. Inf. Sci..
[32] Ben Carterette,et al. Multiple testing in statistical analysis of systems-based information retrieval experiments , 2012, TOIS.
[33] Francis Tuerlinckx,et al. Type S error rates for classical and Bayesian single and multiple comparison procedures , 2000, Comput. Stat..
[34] Nicola Ferro,et al. Are IR Evaluation Measures on an Interval Scale? , 2017, ICTIR.
[35] F. Wilcoxon. Individual Comparisons by Ranking Methods , 1945 .
[36] Ben Carterette,et al. But Is It Statistically Significant?: Statistical Significance in IR Research, 1995-2014 , 2017, SIGIR.
[37] Mónica Marrero,et al. A comparison of the optimality of statistical significance tests for information retrieval evaluation , 2013, SIGIR.
[38] Ellen M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..
[39] Ellen M. Voorhees,et al. The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.
[40] Ellen M. Voorhees,et al. Topic set size redux , 2009, SIGIR.
[41] H. Kaiser,et al. Directional statistical decisions. , 1960, Psychological review.
[42] David E. Losada,et al. Using score distributions to compare statistical significance tests for information retrieval evaluation , 2019, J. Assoc. Inf. Sci. Technol..
[43] W. J. Conover,et al. On Methods of Handling Ties in the Wilcoxon Signed-Rank Test , 1973 .
[44] Tetsuya Sakai,et al. Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.
[45] Stephen E. Robertson,et al. On per-topic variance in IR evaluation , 2012, SIGIR '12.
[46] Tetsuya Sakai,et al. Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.