Testing Cross-Validation Variants in Ranking Environments

This research investigates how to determine whether two rankings can come from the same distribution. We evaluate three hybrid tests: Wilcoxon’s, Dietterich’s, and Alpaydin’s statistical tests combined with cross-validation, each operating with folds ranging from 5 to 10, thus altogether 18 variants. We have used the framework of a popular comparative statistical test, the Sum of Ranking Differences, but our results are representative of all ranking environments. To compare these methods, we have followed an innovative approach borrowed from Economics. We designed eight scenarios for testing type I and II errors. These represent typical situations (i.e., different data structures) that cross-validation (CV) tests face routinely. The optimal CV method depends on the preferences regarding the minimization of type I/II errors, size of the input, and expected patterns in the data. The Wilcoxon method with eight folds proved to be the best under all three investigated input sizes, although there were scenarios and decision aspects where other methods, namely Wilcoxon 10 and Alpaydin 10, performed better.

[1]  Age K. Smilde,et al.  Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies , 2011, Metabolomics.

[2]  K. Héberger,et al.  Generalization of pair correlation method (PCM) for non‐parametric variable selection , 2002 .

[3]  Ulrich Endriss,et al.  Judgment Aggregation with Rationality and Feasibility Constraints , 2018, AAMAS.

[4]  Julian D. Olden,et al.  Assessing transferability of ecological models: an underappreciated aspect of statistical validation , 2012 .

[5]  Aki Vehtari,et al.  Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC , 2015, Statistics and Computing.

[6]  R. Hijmans,et al.  Cross-validation of species distribution models: removing spatial sorting bias and calibration with a null model. , 2012, Ecology.

[7]  Surendra Kumar,et al.  Classification of carcinogenic and mutagenic properties using machine learning method , 2017 .

[8]  Károly Héberger,et al.  Apportionment and districting by Sum of Ranking Differences , 2020, PloS one.

[9]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[10]  H. Pataki,et al.  Testing the performance of pure spectrum resolution from Raman hyperspectral images of differently manufactured pharmaceutical tablets. , 2012, Analytica chimica acta.

[11]  Károly Héberger,et al.  Conditional Fisher's exact test as a selection criterion for pair-correlation method. Type I and Type II errors , 2001 .

[12]  Lawrence G. Sager Handbook of Computational Social Choice , 2015 .

[13]  D. Milojković-Opsenica,et al.  Linear modeling of the soil-water partition coefficient normalized to organic carbon content by reversed-phase thin-layer chromatography. , 2016, Journal of chromatography. A.

[14]  Károly Héberger,et al.  Multicriteria decision making for evergreen problems in food science by sum of ranking differences. , 2020, Food chemistry.

[15]  K. Héberger,et al.  Method and model comparison by sum of ranking differences in cases of repeated observations (ties) , 2013 .

[16]  Ethem Alpaydın,et al.  Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms , 1999, Neural Comput..

[17]  K. Héberger Sum of ranking differences compares methods or models fairly , 2010 .

[18]  Ji-Hyun Kim,et al.  Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , 2009, Comput. Stat. Data Anal..

[19]  K. Héberger,et al.  Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers , 2011 .

[20]  Raimon Tolosana-Delgado,et al.  Interpolation algorithm ranking using cross-validation and the role of smoothing effect. A coal zone example , 2010, Comput. Geosci..

[21]  Ljubomir J. Buturovic,et al.  Cross-validation pitfalls when selecting and assessing regression and classification models , 2014, Journal of Cheminformatics.

[22]  Vijay V. Raghavan,et al.  Unsupervised Learning to Rank Aggregation using Parameterized Function Optimization , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[23]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[24]  Eyke Hüllermeier,et al.  Preference Learning Using the Choquet Integral: The Case of Multipartite Ranking , 2012, IEEE Transactions on Fuzzy Systems.

[25]  R. Amiri,et al.  In vitro application of integrated selection index for screening drought tolerant genotypes in common wheat , 2016 .

[26]  Leonidas A. Zampetakis,et al.  Quantifying uncertainty in ranking problems with composite indicators: a Bayesian approach , 2010 .

[27]  Eytan Domany,et al.  Ranking Under Uncertainty , 2012, UAI.

[28]  A. Isaksson,et al.  Cross-validation and bootstrapping are unreliable in small sample classification , 2008, Pattern Recognit. Lett..

[29]  G. Barlow,et al.  Predicting and assessing dominance from size and coloration in the polychromatic midas cichlid , 1976, Animal Behaviour.

[30]  P. Filzmoser,et al.  Repeated double cross validation , 2009 .

[31]  W. J Youden,et al.  Statistical Manual of the Association of Official Analytical Chemists , 1984 .

[32]  R. Graham,et al.  Spearman's Footrule as a Measure of Disarray , 1977 .

[33]  Renato A. Krohling,et al.  Ranking of Classification Algorithms in Terms of Mean–Standard Deviation Using A-TOPSIS , 2016, Annals of Data Science.

[34]  Carsten F. Dormann,et al.  Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure , 2017 .

[35]  Philippe Mongin,et al.  The doctrinal paradox, the discursive dilemma, and logical aggregation theory , 2012 .

[36]  Tadayoshi Fushiki,et al.  Estimation of prediction error by using K-fold cross-validation , 2011, Stat. Comput..

[37]  Eric Conrad,et al.  Chapter 5 – Domain 5: Identity and access management (controlling access and managing identity) , 2017 .

[38]  R Bro,et al.  Cross-validation of component models: A critical look at current methods , 2008, Analytical and bioanalytical chemistry.

[39]  John Triantafilis,et al.  Five Geostatistical Models to Predict Soil Salinity from Electromagnetic Induction Data Across Irrigated Cotton , 2001 .

[40]  Dan Roth,et al.  An Unsupervised Learning Algorithm for Rank Aggregation , 2007, ECML.

[41]  M. Kimberley,et al.  Comparison of spatial prediction techniques for developing Pinus radiata productivity surfaces across New Zealand , 2009 .

[42]  Thomas A Louis,et al.  Uncertainty in Rank Estimation: Implications for Value-Added Modeling Accountability Systems , 2002, Journal of educational and behavioral statistics : a quarterly publication sponsored by the American Educational Research Association and the American Statistical Association.

[43]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.