The Illusion of Distribution-Free Small-Sample Classification in Genomics

Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.

[1]  Edward R. Dougherty,et al.  EPISTEMOLOGY OF COMPUTATIONAL BIOLOGY: MATHEMATICAL MODELS AND EXPERIMENTAL PREDICTION AS THE BASIS OF THEIR VALIDITY , 2006 .

[2]  Rabab K. Ward,et al.  91 FADS AND FALLACIES IN THE NAME OF SMALL-SAMPLE MICROARRAY CLASSIFICATION , 2007 .

[3]  Anne-Laure Boulesteix,et al.  Over-optimism in bioinformatics: an illustration , 2010, Bioinform..

[4]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[5]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[6]  D. Allison,et al.  Towards sound epistemological foundations of statistical methods for high-dimensional biology , 2004, Nature Genetics.

[7]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[8]  Anne-Laure Boulesteix,et al.  Over-optimism in bioinformatics research , 2010, Bioinform..

[9]  Edward R. Dougherty,et al.  Multiple-rule bias in the comparison of classification rules , 2011, Bioinform..

[10]  U. Braga-Neto,et al.  Fads and fallacies in the name of small-sample microarray classification - A highlight of misunderstanding and erroneous usage in the applications of genomic signal processing , 2007, IEEE Signal Processing Magazine.

[11]  Edward R. Dougherty,et al.  Analytic Study of Performance of Error Estimators for Linear Discriminant Analysis , 2011, IEEE Transactions on Signal Processing.

[12]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[13]  Ulisses Braga-Neto,et al.  Exact performance of error estimators for discrete classifiers , 2005, Pattern Recognit..

[14]  Edward R Dougherty,et al.  On the Epistemological Crisis in Genomics , 2008, Current genomics.

[15]  Edward R. Dougherty,et al.  Reporting bias when using real data sets to analyze classification performance , 2010, Bioinform..

[16]  Ulisses Braga-Neto,et al.  Exact correlation between actual and estimated errors in discrete classification , 2010, Pattern Recognit. Lett..

[17]  Blaise Hanczar,et al.  On the Comparison of Classifiers for Microarray Data , 2010 .

[18]  Blaise Hanczar,et al.  Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings , 2007, EURASIP J. Bioinform. Syst. Biol..

[19]  E. Dougherty,et al.  Confidence Intervals for the True Classification Error Conditioned on the Estimated Error , 2006, Technology in cancer research & treatment.

[20]  Ned Glick,et al.  Additive estimators for probabilities of correct classification , 1978, Pattern Recognit..

[21]  Joaquín Dopazo,et al.  Papers on normalization, variable selection, classification or clustering of microarray data , 2009, Bioinform..

[22]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[23]  Anne-Laure Boulesteix,et al.  Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction , 2009, BMC medical research methodology.

[24]  F. J. Wyman,et al.  A comparison of asymptotic error rate expansions for the sample linear discriminant function , 1990, Pattern Recognit..