Finding Important Genes from High-Dimensional Data: An Appraisal of Statistical Tests and Machine-Learning Approaches

Over the past decades, statisticians and machine-learning researchers have developed literally thousands of new tools for the reduction of high-dimensional data in order to identify the variables most responsible for a particular trait. These tools have applications in a plethora of settings, including data analysis in the fields of business, education, forensics, and biology (such as microarray, proteomics, brain imaging), to name a few. In the present work, we focus our investigation on the limitations and potential misuses of certain tools in the analysis of the benchmark colon cancer data (2,000 variables; Alon et al., 1999) and the prostate cancer data (6,033 variables; Efron, 2010, 2008). Our analysis demonstrates that models that produce 100% accuracy measures often select different sets of genes and cannot stand the scrutiny of parameter estimates and model stability. Furthermore, we created a host of simulation datasets and "artificial diseases" to evaluate the reliability of commonly used statistical and data mining tools. We found that certain widely used models can classify the data with 100% accuracy without using any of the variables responsible for the disease. With moderate sample size and suitable pre-screening, stochastic gradient boosting will be shown to be a superior model for gene selection and variable screening from high-dimensional datasets.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  John D. Storey False Discovery Rates , 2010 .

[3]  Liu Yang,et al.  Quantitative Epistasis Analysis and Pathway Inference from Genetic Interaction Data , 2011, PLoS Comput. Biol..

[4]  Jian Huang,et al.  A Selective Review of Group Selection in High-Dimensional Models. , 2012, Statistical science : a review journal of the Institute of Mathematical Statistics.

[5]  Jay Magidson,et al.  Correlated Component Regression: A Prediction/Classification Methodology for Possibly Many Features , 2010 .

[6]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  Donald W. Bowden,et al.  Genome-Wide Association Study of Coronary Heart Disease and Its Risk Factors in 8,090 African Americans: The NHLBI CARe Project , 2011, PLoS genetics.

[9]  Hua Liang,et al.  ESTIMATION AND VARIABLE SELECTION FOR GENERALIZED ADDITIVE PARTIAL LINEAR MODELS. , 2011, Annals of statistics.

[10]  R. W. Doerge,et al.  Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments , 2002, Bioinform..

[11]  Jeffrey T. Leek,et al.  Statistical Applications in Genetics and Molecular Biology The Joint Null Criterion for Multiple Hypothesis Tests , 2011 .

[12]  Yudong D. He,et al.  A Novel Statistical Algorithm for Gene Expression Analysis Helps Differentiate Pregnane X Receptor-Dependent and Independent Mechanisms of Toxicity , 2010, PloS one.

[13]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[14]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[15]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[16]  John D. Storey,et al.  False Discovery Rate , 2020, International Encyclopedia of Statistical Science.

[17]  Mickael Guedj,et al.  Should We Abandon the t-Test in the Analysis of Gene Expression Microarray Data: A Comparison of Variance Modeling Strategies , 2010, PloS one.

[18]  Y. Benjamini,et al.  Adaptive linear step-up procedures that control the false discovery rate , 2006 .

[19]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[20]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2001, Springer Series in Statistics.

[21]  Nema Dean,et al.  Latent class analysis variable selection , 2010, Annals of the Institute of Statistical Mathematics.

[22]  Qinghua Hu,et al.  An efficient gene selection technique for cancer recognition based on neighborhood mutual information , 2010, Int. J. Mach. Learn. Cybern..

[23]  Stephen M. Stigler,et al.  The Changing History of Robustness , 2010 .

[24]  Subhabrata Chakrabarti,et al.  Complex genetic mechanisms in glaucoma: An overview , 2011, Indian journal of ophthalmology.

[25]  Dean P. Foster,et al.  Variable Selection in Data Mining , 2004 .

[26]  Prasad A. Naik,et al.  A New Dimension Reduction Approach for Data-Rich Marketing Environments: Sliced Inverse Regression , 2000 .

[27]  Martin T. Wells,et al.  Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments , 2010, 1101.0905.

[28]  J. A. Ferreira,et al.  On the Benjamini-Hochberg method , 2006, math/0611265.

[29]  Houston H. Stokes,et al.  On the advantage of using two or more econometric software systems to solve the same problem , 2004 .

[30]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[31]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[32]  Alejandro Sierra,et al.  Skipping Fisher's Criterion , 2003, IbPRIA.

[33]  M. Yuan,et al.  On the non‐negative garrotte estimator , 2007 .

[34]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[35]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[36]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[37]  R. Gregory Significance , 2003, Perception.

[38]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[39]  C. Glymour,et al.  STATISTICS AND CAUSAL INFERENCE , 1985 .

[40]  Richard Simon,et al.  Microarray-based cancer prediction using single genes , 2011, BMC Bioinformatics.

[41]  Bradley Efron,et al.  The Future of Indirect Evidence. , 2010, Statistical science : a review journal of the Institute of Mathematical Statistics.

[42]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[43]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[44]  Jian Huang,et al.  The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression. , 2011, Annals of statistics.

[45]  Yuh-Jye Lee,et al.  Incremental Forward Feature Selection with Application to Microarray Gene Expression Data , 2008, Journal of biopharmaceutical statistics.

[46]  David J Hand,et al.  Breast Cancer Diagnosis from Proteomic Mass Spectrometry Data: A Comparative Evaluation , 2008, Statistical applications in genetics and molecular biology.

[47]  Bradley Efron,et al.  False discovery rates and copy number variation , 2011 .

[48]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[49]  Yoav Benjamini,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Comment. , 2008 .

[50]  Basilio de Braganca Pereira,et al.  Data Mining Using Neural Networks: A Guide for Statisticians , 2009 .

[51]  Galit Shmueli,et al.  To Explain or To Predict? , 2010, 1101.0891.

[52]  Kristel Van Steen,et al.  Travelling the world of gene-gene interactions , 2012, Briefings Bioinform..

[53]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[54]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology High-Dimensional Regression and Variable Selection Using CAR Scores , 2011 .

[55]  C. Croce,et al.  Muir-Torre-like syndrome in Fhit-deficient mice. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[56]  D. Freedman Randomization Does Not Justify Logistic Regression , 2008, 0808.3914.

[57]  J. Pearl Statistics and causal inference: A review , 2003 .

[58]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[59]  Jun Li,et al.  Susceptibility locus for clinical and subclinical coronary artery disease at chromosome 9p21 in the multi-ethnic ADVANCE study. , 2008, Human molecular genetics.