Application of the Bayesian MMSE estimator for classification error to gene expression microarray data

MOTIVATION With the development of high-throughput genomic and proteomic technologies, coupled with the inherent difficulties in obtaining large samples, biomedicine faces difficult small-sample classification issues, in particular, error estimation. Most popular error estimation methods are motivated by intuition rather than mathematical inference. A recently proposed error estimator based on Bayesian minimum mean square error estimation places error estimation in an optimal filtering framework. In this work, we examine the application of this error estimator to gene expression microarray data, including the suitability of the Gaussian model with normal-inverse-Wishart priors and how to find prior probabilities. RESULTS We provide an implementation for non-linear classification, where closed form solutions are not available. We propose a methodology for calibrating normal-inverse-Wishart priors based on discarded microarray data and examine the performance on synthetic high-dimensional data and a real dataset from a breast cancer study. The calibrated Bayesian error estimator has superior root mean square performance, especially with moderate to high expected true errors and small feature sizes. AVAILABILITY We have implemented in C code the Bayesian error estimator for Gaussian distributions and normal-inverse-Wishart priors for both linear classifiers, with exact closed-form representations, and arbitrary classifiers, where we use a Monte Carlo approximation. Our code for the Bayesian error estimator and a toolbox of related utilities are available at http://gsp.tamu.edu/Publications/supplementary/dalton11a. Several supporting simulations are also included. CONTACT ldalton@tamu.edu

[1]  Edward R. Dougherty,et al.  EPISTEMOLOGY OF COMPUTATIONAL BIOLOGY: MATHEMATICAL MODELS AND EXPERIMENTAL PREDICTION AS THE BASIS OF THEIR VALIDITY , 2006 .

[2]  References , 1971 .

[3]  José A. Villaseñor Alva,et al.  A Generalization of Shapiro–Wilk's Test for Multivariate Normality , 2009 .

[4]  Blaise Hanczar,et al.  Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings , 2007, EURASIP J. Bioinform. Syst. Biol..

[5]  Edward R. Dougherty,et al.  What should be expected from feature selection in small-sample settings , 2006, Bioinform..

[6]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[7]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[8]  Edward R. Dougherty,et al.  Reporting bias when using real data sets to analyze classification performance , 2010, Bioinform..

[9]  Jaakko Astola,et al.  Comparison of Affymetrix data normalization methods using 6,926 experiments across five array generations , 2009, BMC Bioinformatics.

[10]  Magnus Rattray,et al.  Making sense of microarray data distributions , 2002, Bioinform..

[11]  Ulisses Braga-Neto,et al.  Exact correlation between actual and estimated errors in discrete classification , 2010, Pattern Recognit. Lett..

[12]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[13]  S. Shapiro,et al.  An analysis of variance test for normality ( complete samp 1 es ) t , 2007 .

[14]  Edward R. Dougherty,et al.  Bayesian Minimum Mean-Square Error Estimation for Classification Error—Part II: Linear Classification of Gaussian Models , 2011, IEEE Transactions on Signal Processing.

[15]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[16]  W. Michael Conklin,et al.  Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing , 2005, Technometrics.

[17]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[18]  Ulisses Braga-Neto,et al.  On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers , 2009, Pattern Recognit..

[19]  Daniel B. Rowe,et al.  Multivariate Bayesian Statistics: Models for Source Separation and Signal Unmixing , 2002 .

[20]  Mark E. Johnson Multivariate Statistical Simulation: Johnson/Multivariate , 1987 .

[21]  Ulisses Braga-Neto,et al.  Joint Sampling Distribution Between Actual and Estimated Classification Errors for Linear Discriminant Analysis , 2010, IEEE Transactions on Information Theory.

[22]  Ulisses Braga-Neto,et al.  Bolstered error estimation , 2004, Pattern Recognit..

[23]  Chao Sima,et al.  Performance of Feature Selection Methods , 2009, Current genomics.

[24]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[25]  Ned Glick,et al.  Additive estimators for probabilities of correct classification , 1978, Pattern Recognit..

[26]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[27]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[28]  Blaise Hanczar,et al.  Performance of Error Estimators for Classification , 2010 .

[29]  Mark E. Johnson,et al.  Multivariate Statistical Simulation , 1989, International Encyclopedia of Statistical Science.

[30]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[31]  Edward R. Dougherty,et al.  Bayesian Minimum Mean-Square Error Estimation for Classification Error—Part I: Definition and the Bayesian MMSE Error Estimator for Discrete Classification , 2011, IEEE Transactions on Signal Processing.