Is Bagging Effective in the Classification of Small-Sample Genomic and Proteomic Data?

There has been considerable interest recently in the application of bagging in the classification of both gene-expression data and protein-abundance mass spectrometry data. The approach is often justified by the improvement it produces on the performance of unstable, overfitting classification rules under small-sample situations. However, the question of real practical interest is whether the ensemble scheme will improve performance of those classifiers sufficiently to beat the performance of single stable, nonoverfitting classifiers, in the case of small-sample genomic and proteomic data sets. To investigate that question, we conducted a detailed empirical study, using publicly-available data sets from published genomic and proteomic studies. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overfitting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, nonoverfitting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, as expected, the ensemble method did not improve the performance of these classifiers significantly. Representative experimental results are presented and discussed in this work.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[3]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[4]  Kathleen Marchal,et al.  Evaluation of time profile reconstruction from complex two-color microarray designs , 2008, BMC Bioinformatics.

[5]  M.H. Moradi,et al.  A Novel Ensemble Strategy for Classification of Prostate Cancer Protein Mass Spectra , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[6]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[7]  Edward R. Dougherty,et al.  Performance of feature-selection methods in the classification of high-dimension data , 2009, Pattern Recognit..

[8]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[9]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[10]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[11]  中澤 真,et al.  Devroye, L., Gyorfi, L. and Lugosi, G. : A Probabilistic Theory of Pattern Recognition, Springer (1996). , 1997 .

[12]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[13]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[14]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[15]  Michael R. Chernick,et al.  Bootstrap Methods: A Practitioner's Guide , 1999 .

[16]  G. Izmirlian,et al.  Application of the Random Forest Classification Algorithm to a SELDI‐TOF Proteomics Study in the Setting of a Cancer Prevention Trial , 2004, Annals of the New York Academy of Sciences.

[17]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[18]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[19]  S. Y. Sohn,et al.  Experimental study for the comparison of classifier combination methods , 2007, Pattern Recognit..

[20]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[21]  Ching Y. Suen,et al.  Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[22]  Yanchun Zhang,et al.  Bagging Support Vector Machine for Classification of SELDI-ToF Mass Spectra of Ovarian Cancer Serum Samples , 2007, Australian Conference on Artificial Intelligence.

[23]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Ana Osorio,et al.  A predictor based on the somatic genomic changes of the BRCA1/BRCA2 breast cancer tumors identifies the non-BRCA1/BRCA2 tumors with BRCA1 promoter hypermethylation. , 2005, Clinical cancer research : an official journal of the American Association for Cancer Research.

[26]  Ravinder Singh,et al.  Fast-Find: A novel computational approach to analyzing combinatorial motifs , 2006, BMC Bioinformatics.

[27]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Pierre Geurts,et al.  Proteomic mass spectra classification using decision tree based ensemble methods , 2005, Bioinform..

[29]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[30]  Weida Tong,et al.  Using Decision Forest to Classify Prostate Cancer Samples on the Basis of SELDI-TOF MS Data: Assessing Chance Correlation and Prediction Confidence , 2004, Environmental health perspectives.

[31]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[32]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[33]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[34]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[35]  Thomas P Conrads,et al.  The SELDI-TOF MS approach to proteomics: protein profiling and biomarker identification. , 2002, Biochemical and biophysical research communications.

[36]  Bruschi,et al.  Classification of , 2010 .

[37]  P. Schellhammer,et al.  Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. , 2002, Cancer research.

[38]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[39]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[40]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.