Group Variable Selection Methods and Their Applications in Analysis of Genomic Data

1.1 Introduction Regression is a simple but the most useful statistical method in data analysis. The goal of regression analysis is to discover the relationship between a response y and a set of predictors x 1 , x 2 ,. .. , x p. When fitting a regression model, besides prediction accuracy, parsimony is another important criterion of goodness. Simpler models are preferred by researchers for easier interpretation of the relationship between x and y. Moreover, discarding irrelevant predictors often improves prediction accuracy [13]. Variable selection methods have long been used in regression analysis, for example forward selection, backward elimination, best subset regression. The number of variables p in the traditional setting is typically 10 or at most a few dozens. Modern scientific technology, led by the microarray, has produced data dramatically above the conventional scale. We have p = 1, 000 to 10,000 in gene expression microarray data, and p up to 500,000 in single nucleotide polymorphism (SNP) data. To make things more complicated, the large number of variables in the biological data are dependent. For example, it is well known that for genes that share a common biological function or participate in the same metabolic pathway, the pairwise correlations among them can be very high [14]. Traditional variable selection methods that select variables one by one may miss important group effects on pathways. Consequently, when traditional variable selection methods are applied in multiple data sets from a common biological system, the selected variables from the multiple studies may show little overlap. To overcome the challenges, we have developed a series of group variable selection methods, which construct highly correlated genes into a group and select the whole group once one gene among them is in the model. In this chapter, we introduce the idea of group variable selection and illustrate its utility by applying the methods to genomic data analysis.

[1]  V. Sheffield,et al.  Regulation of gene expression in the mammalian eye and its relevance to eye disease , 2006, Proceedings of the National Academy of Sciences.

[2]  G. Wahba,et al.  A NOTE ON THE LASSO AND RELATED PROCEDURES IN MODEL SELECTION , 2006 .

[3]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[4]  C. Nusbaum,et al.  Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. , 1998, Science.

[5]  Jing Wu,et al.  Statistical Applications in Genetics and Molecular Biology Statistical Methods in Integrative Analysis for Gene Regulatory Modules , 2011 .

[6]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[7]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[8]  Jing Wu,et al.  Computation-Based Discovery of Cis-Regulatory Modules by Hidden Markov Model , 2008, J. Comput. Biol..

[9]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[10]  Yongdai Kim,et al.  Smoothly Clipped Absolute Deviation on High Dimensions , 2008 .

[11]  C. Molony,et al.  Genetic analysis of genome-wide variation in human gene expression , 2004, Nature.

[12]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[13]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[14]  Nicola J. Rinaldi,et al.  Serial Regulation of Transcriptional Regulators in the Yeast Cell Cycle , 2001, Cell.

[15]  Jun Xie,et al.  Bayesian Models and Markov Chain Monte Carlo Methods for Protein Motifs with the Secondary Characteristics , 2005, J. Comput. Biol..

[16]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[17]  Alan J. Lee,et al.  Linear Regression Analysis: Seber/Linear , 2003 .

[18]  Jun Xie,et al.  Protein Multiple Alignment Incorporating Primary and Secondary Structure Information , 2006, J. Comput. Biol..

[19]  N. Yi,et al.  Bayesian LASSO for Quantitative Trait Loci Mapping , 2008, Genetics.

[20]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[21]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[22]  X. Huo,et al.  When do stepwise algorithms meet subset selection criteria , 2007, 0708.2149.

[23]  George A. F. Seber,et al.  Linear regression analysis , 1977 .

[24]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[25]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[26]  Kam D. Dahlquist,et al.  Regression Approaches for Microarray Data Analysis , 2002, J. Comput. Biol..

[27]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[28]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[29]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .