Regression Approaches for Microarray Data Analysis

A variety of new procedures have been devised to handle the two-sample comparison (e.g., tumor versus normal tissue) of gene expression values as measured with microarrays. Such new methods are required in part because of some defining characteristics of microarray-based studies: (i) the very large number of genes contributing expression measures which far exceeds the number of samples (observations) available and (ii) the fact that by virtue of pathway/network relationships, the gene expression measures tend to be highly correlated. These concerns are exacerbated in the regression setting, where the objective is to relate gene expression, simultaneously for multiple genes, to some external outcome or phenotype. Correspondingly, several methods have been recently proposed for addressing these issues. We briefly critique some of these methods prior to a detailed evaluation of gene harvesting. This reveals that gene harvesting, without additional constraints, can yield artifactual solutions. Results obtained employing such constraints motivate the use of regularized regression procedures such as the lasso, least angle regression, and support vector machines. Model selection and solution multiplicity issues are also discussed. The methods are evaluated using a microarray-based study of cardiomyopathy in transgenic mice.

[1]  Chester Hartman,et al.  Rejoinder by the Author , 1965 .

[2]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[3]  C. L. Mallows Some comments on C_p , 1973 .

[4]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  J. Friedman Multivariate adaptive regression splines , 1990 .

[7]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[8]  D. Guardavaccaro,et al.  Inhibition of differentiation in myoblasts deprived of the interferon-related protein PC4. , 1995, Cell growth & differentiation : the molecular biology journal of the American Association for Cancer Research.

[9]  C. Mallows More comments on C p , 1995 .

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  Jianming Ye On Measuring and Correcting the Effects of Data Mining and Model Selection , 1998 .

[12]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  P. Leder,et al.  The Neuroendocrine Protein 7B2 Is Required for Peptide Hormone Processing In Vivo and Provides a Novel Mechanism for Pituitary Cushing’s Disease , 1999, Cell.

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  C. Kent,et al.  Interactions among pathways for phosphatidylcholine metabolism, CTP synthesis and secretion through the Golgi apparatus. , 1999, Trends in biochemical sciences.

[16]  R. Tibshirani,et al.  The Covariance Inflation Criterion for Adaptive Model Selection , 1999 .

[17]  P. Schimmel,et al.  Getting tRNA synthetases into the nucleus. , 1999, Trends in biochemical sciences.

[18]  B. Conklin,et al.  Conditional expression and signaling of a specifically designed Gi-coupled receptor in transgenic mice , 1999, Nature Biotechnology.

[19]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[20]  K. Vranizan,et al.  Conditional expression of a Gi-coupled receptor causes ventricular conduction delay and a lethal cardiomyopathy. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[22]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[23]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[24]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[25]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[26]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[27]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[29]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[30]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[32]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[33]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[35]  Hongzhe Li,et al.  Cluster-Rasch models for microarray gene expression data , 2001, Genome Biology.

[36]  Michael L. Bittner,et al.  Strong Feature Sets from Small Samples , 2002, J. Comput. Biol..

[37]  Michael B. Eisen,et al.  Identification of regulatory elements using a feature selection method , 2002, Bioinform..

[38]  Charles L. Kooperberg,et al.  Improved Background Correction for Spotted DNA Microarrays , 2002, J. Comput. Biol..

[39]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[40]  J. Mayer,et al.  Myocardial immediate early gene activation after cardiopulmonary bypass with cardiac ischemia-reperfusion. , 2002, The Annals of thoracic surgery.

[41]  Aled M. Edwards,et al.  Unfolding of Microarray Data , 2002, J. Comput. Biol..

[42]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[43]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[44]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[45]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[46]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[47]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .