Classifier variability: Accounting for training and testing

We categorize the statistical assessment of classifiers into three levels: assessing the classification performance and its testing variability conditional on a fixed training set, assessing the performance and its variability that accounts for both training and testing, and assessing the performance averaging over training sets and its variability that accounts for both training and testing. We derived analytical expressions for the variance of the estimated AUC and provide freely available software implemented with an efficient computation algorithm. Our approach can be applied to assess any classifier that has ordinal (continuous or discrete) outputs. Applications to simulated and real datasets are presented to illustrate our methods.

[1]  K. Berbaum,et al.  Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. , 1992, Investigative radiology.

[2]  Marcus A. Maloof,et al.  A General Model for Finite-Sample Effects in Training and Testing of Competing Classifiers , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[4]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[5]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[6]  Matthew A. Kupinski,et al.  Probabilistic foundations of the MRMC method , 2005, SPIE Medical Imaging.

[7]  R. F. Wagner,et al.  Assessment of medical imaging systems and computer aids: a tutorial review. , 2007, Academic radiology.

[8]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[9]  R. F. Wagner,et al.  Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis. , 2000, Academic radiology.

[10]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[11]  Ian T. Nabney,et al.  Netlab: Algorithms for Pattern Recognition , 2002 .

[12]  Murray H. Loew,et al.  Assessing Classifiers from Two Independent Data Sets Using ROC Analysis: A Nonparametric Approach , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[14]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[15]  Brandon D Gallas,et al.  One-shot estimate of MRMC variance: AUC. , 2006, Academic radiology.

[16]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[17]  R. Randles,et al.  Introduction to the Theory of Nonparametric Statistics , 1991 .

[18]  R. Warnke,et al.  Immune signatures in follicular lymphoma. , 2005, The New England journal of medicine.

[19]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[20]  R. F. Wagner,et al.  A Framework for Random-Effects ROC Analysis: Biases with the Bootstrap and Other Variance Estimators , 2009 .

[21]  W. Hoeffding A Class of Statistics with Asymptotically Normal Distribution , 1948 .

[22]  L. Staudt,et al.  Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. , 2004, The New England journal of medicine.

[23]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[24]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[25]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[26]  Murray H. Loew,et al.  Comparison of non-parametric methods for assessing classifier performance in terms of ROC parameters , 2004, 33rd Applied Imagery Pattern Recognition Workshop (AIPR'04).

[27]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[28]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[29]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[30]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[31]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[32]  Keinosuke Fukunaga,et al.  Estimation of Classifier Performance , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  N. Obuchowski,et al.  Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: An anova approach with dependent observations , 1995 .

[34]  Blaise Hanczar,et al.  Small-sample precision of ROC-related estimates , 2010, Bioinform..

[35]  Alan J. Lee,et al.  U-Statistics: Theory and Practice , 1990 .

[36]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[37]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[38]  Douglas A. Wolfe,et al.  Introduction to the Theory of Nonparametric Statistics. , 1980 .

[39]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[40]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..