High-dimensional bolstered error estimation

MOTIVATION In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces. RESULTS This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known. AVAILABILITY Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering. CONTACT edward@mail.ece.tamu.edu.

[1]  Anne-Laure Boulesteix,et al.  Over-optimism in bioinformatics: an illustration , 2010, Bioinform..

[2]  John Crowley,et al.  The molecular classification of multiple myeloma. , 2006, Blood.

[3]  E.R. Dougherty,et al.  Preliminary study on bolstered error estimation in high-dimensional spaces , 2008, 2008 IEEE International Workshop on Genomic Signal Processing and Statistics.

[4]  Edward R. Dougherty,et al.  Reporting bias when using real data sets to analyze classification performance , 2010, Bioinform..

[5]  Joaquín Dopazo,et al.  Papers on normalization, variable selection, classification or clustering of microarray data , 2009, Bioinform..

[6]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[7]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[9]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[10]  Blaise Hanczar,et al.  Small-sample precision of ROC-related estimates , 2010, Bioinform..

[11]  Ulisses Braga-Neto,et al.  Impact of error estimation on feature selection , 2005, Pattern Recognit..

[12]  Anne-Laure Boulesteix,et al.  Over-optimism in bioinformatics research , 2010, Bioinform..

[13]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[14]  Edward R. Dougherty,et al.  Superior feature-set ranking for small samples using bolstered error estimation , 2005, Bioinform..

[15]  Richard Simon,et al.  A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification , 2007, Statistics in medicine.

[16]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[17]  Anne-Laure Boulesteix,et al.  Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction , 2009, BMC medical research methodology.

[18]  Blaise Hanczar,et al.  Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings , 2007, EURASIP J. Bioinform. Syst. Biol..

[19]  N. L. Johnson,et al.  Systems of Frequency Curves , 1969 .

[20]  Edward R. Dougherty,et al.  What should be expected from feature selection in small-sample settings , 2006, Bioinform..

[21]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[22]  Rabab K. Ward,et al.  91 FADS AND FALLACIES IN THE NAME OF SMALL-SAMPLE MICROARRAY CLASSIFICATION , 2007 .

[23]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[24]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[25]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[26]  U. Braga-Neto,et al.  Fads and fallacies in the name of small-sample microarray classification - A highlight of misunderstanding and erroneous usage in the applications of genomic signal processing , 2007, IEEE Signal Processing Magazine.

[27]  Michael L. Bittner,et al.  Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge , 2010, Cancer informatics.

[28]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[29]  Edward R. Dougherty,et al.  Optimal convex error estimators for classification , 2006, Pattern Recognit..

[30]  Leslie Grate,et al.  Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery , 2005, BMC Bioinformatics.

[31]  M.D. Wang,et al.  Improved Bolstering Error Estimation for Gene Ranking , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[32]  Ulisses Braga-Neto,et al.  Bolstered error estimation , 2004, Pattern Recognit..