Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation
暂无分享,去创建一个
[1] Marco Marozzi,et al. Nonparametric Simultaneous Tests for Location and Scale Testing: A Comparison of Several Methods , 2013, Commun. Stat. Simul. Comput..
[2] Stephen E. Robertson,et al. Modelling Score Distributions Without Actual Scores , 2013, ICTIR.
[3] Mark Sanderson,et al. Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.
[4] Gordon V. Cormack,et al. Validity and power of t-test for comparing MAP and GMAP , 2007, SIGIR.
[5] Tetsuya Sakai,et al. Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015 , 2016, SIGIR.
[6] C. Borror. Practical Nonparametric Statistics, 3rd Ed. , 2001 .
[7] Robustness Against Inequality of Variances , 1982 .
[8] Ben Carterette,et al. Statistical Significance Testing in Information Retrieval: Theory and Practice , 2014, SIGIR.
[9] James Allan,et al. A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.
[10] David E. Losada,et al. Using score distributions to compare statistical significance tests for information retrieval evaluation , 2019, J. Assoc. Inf. Sci. Technol..
[11] J. J. Higgins,et al. Comparison of the power of the paired samples t test to that of Wilcoxon's signed-ranks test under various population shapes. , 1985 .
[12] Julián Urbano,et al. Stochastic Simulation of Test Collections: Evaluation Scores , 2018, SIGIR.
[13] H. Levene. Robust tests for equality of variances , 1961 .
[14] Tetsuya Sakai. Two Sample T-tests for IR Evaluation: Student or Welch? , 2016, SIGIR.
[15] Ellen M. Voorhees,et al. Evaluation by highly relevant documents , 2001, SIGIR '01.
[16] Alan Hanjalic,et al. Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors , 2019, SIGIR.
[17] H. Keselman,et al. Multiple Comparison Procedures , 2005 .
[18] Justin Zobel,et al. How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.
[19] Y. B. Wah,et al. Power comparisons of Shapiro-Wilk , Kolmogorov-Smirnov , Lilliefors and Anderson-Darling tests , 2011 .
[20] Yifan Huang,et al. To permute or not to permute , 2006, Bioinform..
[21] Robert J. Boik,et al. The Fisher-Pitman permutation test: A non-robust alternative to the normal theory F test when variances are heterogeneous , 1987 .
[22] Evangelos Kanoulas,et al. Score distribution models: assumptions, intuition, and robustness to score manipulation , 2010, SIGIR.
[23] Tetsuya Sakai,et al. Statistical reform in information retrieval? , 2014, SIGF.
[24] R. Blair,et al. A more realistic look at the robustness and Type II error properties of the t test to departures from population normality. , 1992 .