Multi-Factorial Analysis of Class Prediction Error: Estimating Optimal Number of Biomarkers for Various Classification Rules

Machine learning and statistical model based classifiers have increasingly been used with more complex and high dimensional biological data obtained from high-throughput technologies. Understanding the impact of various factors associated with large and complex microarray datasets on the predictive performance of classifiers is computationally intensive, under investigated, yet vital in determining the optimal number of biomarkers for various classification purposes aimed towards improved detection, diagnosis, and therapeutic monitoring of diseases. We investigate the impact of microarray based data characteristics on the predictive performance for various classification rules using simulation studies. Our investigation using Random Forest, Support Vector Machines, Linear Discriminant Analysis and k-Nearest Neighbour shows that the predictive performance of classifiers is strongly influenced by training set size, biological and technical variability, replication, fold change and correlation between biomarkers. Optimal number of biomarkers for a classification problem should therefore be estimated taking account of the impact of all these factors. A database of average generalization errors is built for various combinations of these factors. The database of generalization errors can be used for estimating the optimal number of biomarkers for given levels of predictive accuracy as a function of these factors. Examples show that curves from actual biological data resemble that of simulated data with corresponding levels of data characteristics. An R package optBiomarker implementing the method is freely available for academic use from the Comprehensive R Archive Network (http://www.cran.r-project.org/web/packages/optBiomarker/).

[1]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Zixiang Xiong,et al.  Determination of the optimal number of features for quadratic discriminant analysis via the normal approximation to the discriminant distribution , 2005, Pattern Recognit..

[3]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[7]  P. Brown,et al.  Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  R G Ulrich,et al.  Microarray analysis of hepatotoxins in vitro reveals a correlation between gene expression profiles and mechanisms of toxicity. , 2001, Toxicology letters.

[12]  Mizanur Khondoker,et al.  Quantitative assessment of human whole blood RNA as a potential biomarker for infectious disease. , 2007, The Analyst.

[13]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[14]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[15]  Gert R. G. Lanckriet,et al.  Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. , 2005, Genome research.

[16]  Geoffrey J. McLachlan,et al.  Discriminant Analysis and Statistical Pattern Recognition: McLachlan/Discriminant Analysis & Pattern Recog , 2005 .

[17]  J. Banchereau,et al.  Gene expression patterns in blood leukocytes discriminate patients with acute infections. , 2007, Blood.

[18]  Howard Y. Chang,et al.  Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[20]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[21]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[22]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[23]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[24]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[25]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[26]  Mark W. Craven,et al.  Identification of toxicologically predictive gene sets using cDNA microarrays. , 2001, Molecular pharmacology.

[27]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[28]  Jing Zhu,et al.  Apparently low reproducibility of true differential expression discoveries in microarray studies , 2008, Bioinform..

[29]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[30]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[31]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[32]  Hui Xiao,et al.  Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes , 2009, Bioinform..

[33]  Sarunas Raudys Determination of optimal dimensionality in statistical pattern classification , 1979, Pattern Recognit..

[34]  David G. Stork,et al.  Pattern Classification , 1973 .

[35]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.