Variable Selection in Regression Mixture Modeling for the Discovery of Gene Regulatory Networks

The profusion of genomic data through genome sequencing and gene expression microarray technology has facilitated statistical research in determining gene interactions regulating a biological process. Current methods generally consist of a two-stage procedure: clustering gene expression measurements and searching for regulatory “switches,” typically short, conserved sequence patterns (motifs) in the DNA sequence adjacent to the genes. This process often leads to misleading conclusions as incorrect cluster selection may lead to missing important regulatory motifs or making many false discoveries. Treating cluster memberships as known, rather than estimated, introduces bias into analyses, preventing uncertainty about cluster parameters. Further, there is underutilization of the available data, as the sequence information is ignored for purposes of expression clustering and vice versa. We propose a way to address these issues by combining gene clustering and motif discovery in a unified framework, a mixture of hierarchical regression models, with unknown components representing the latent gene clusters, and genomic sequence features linked to the resultant gene expression through a multivariate hierarchical regression. We demonstrate a Monte Carlo method for simultaneous variable selection (for motifs) and clustering (for genes). The selection of the number of components in the mixture is addressed by computing the analytically intractable Bayes factor through a novel multistage mixture importance sampling approach. This methodology is used to analyze a yeast cell cycle dataset to determine an optimal set of motifs that discriminates between groups of genes and simultaneously finds the most significant gene clusters.

[1]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[2]  Gavin Sherlock,et al.  The Stanford Microarray Database accommodates additional microarray platforms and data formats , 2004, Nucleic Acids Res..

[3]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[4]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Nir Friedman,et al.  Context-specific Bayesian clustering for gene expression data , 2001, J. Comput. Biol..

[6]  Jun S. Liu,et al.  Bayesian Clustering with Variable and Transformation Selections , 2003 .

[7]  Adrian E. Raftery,et al.  Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST , 2003, J. Classif..

[8]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[9]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[10]  Adrian E. Raftery,et al.  Computing Normalizing Constants for Finite Mixture Models via Incremental Mixture Importance Sampling (IMIS) , 2006 .

[11]  Faming Liang,et al.  EVOLUTIONARY MONTE CARLO: APPLICATIONS TO Cp MODEL SAMPLING AND CHANGE POINT PROBLEM , 2000 .

[12]  S. Chib Marginal Likelihood from the Gibbs Output , 1995 .

[13]  Jun S. Liu,et al.  Decoding human regulatory circuits. , 2004, Genome research.

[14]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[15]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[16]  Ian Holmes,et al.  Finding Regulatory Elements Using Joint Likelihoods for Sequence and Expression Profile Data , 2000, ISMB.

[17]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[18]  Pietro Liò,et al.  Identification of DNA regulatory motifs using Bayesian variable selection , 2004, Bioinform..

[19]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[20]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[21]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[22]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[23]  Christian Hennig,et al.  Models and Methods for Clusterwise Linear Regression , 1999 .

[24]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[25]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.