Mouse obesity network reconstruction with a variational Bayes algorithm to employ aggressive false positive control

BackgroundWe propose a novel variational Bayes network reconstruction algorithm to extract the most relevant disease factors from high-throughput genomic data-sets. Our algorithm is the only scalable method for regularized network recovery that employs Bayesian model averaging and that can internally estimate an appropriate level of sparsity to ensure few false positives enter the model without the need for cross-validation or a model selection criterion. We use our algorithm to characterize the effect of genetic markers and liver gene expression traits on mouse obesity related phenotypes, including weight, cholesterol, glucose, and free fatty acid levels, in an experiment previously used for discovery and validation of network connections: an F2 intercross between the C57BL/6 J and C3H/HeJ mouse strains, where apolipoprotein E is null on the background.ResultsWe identified eleven genes, Gch1, Zfp69, Dlgap1, Gna14, Yy1, Gabarapl1, Folr2, Fdft1, Cnr2, Slc24a3, and Ccl19, and a quantitative trait locus directly connected to weight, glucose, cholesterol, or free fatty acid levels in our network. None of these genes were identified by other network analyses of this mouse intercross data-set, but all have been previously associated with obesity or related pathologies in independent studies. In addition, through both simulations and data analysis we demonstrate that our algorithm achieves superior performance in terms of power and type I error control than other network recovery algorithms that use the lasso and have bounds on type I error control.ConclusionsOur final network contains 118 previously associated and novel genes affecting weight, cholesterol, glucose, and free fatty acid levels that are excellent obesity risk candidates.

[1]  S. Portnoy Asymptotic Behavior of $M$-Estimators of $p$ Regression Parameters when $p^2/n$ is Large. I. Consistency , 1984 .

[2]  T. Yen A majorization–minimization approach to variable selection using spike and slab priors , 2010, 1005.0891.

[3]  Andreas Zimmer,et al.  Cannabinoid CB2 Receptor Potentiates Obesity-Associated Inflammation, Insulin Resistance and Hepatic Steatosis , 2009, PloS one.

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  E. Schadt Molecular networks as sensors and drivers of common human diseases , 2009, Nature.

[6]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[7]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[8]  S. Horvath,et al.  Weighted gene coexpression network analysis strategies applied to mouse weight , 2007, Mammalian Genome.

[9]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[10]  David Heckerman,et al.  Correction for hidden confounders in the genetic analysis of gene expression , 2010, Proceedings of the National Academy of Sciences.

[11]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[12]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[13]  Rachel B. Brem,et al.  Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks , 2008, Nature Genetics.

[14]  Matthias Blüher,et al.  Positional Cloning of Zinc Finger Domain Transcription Factor Zfp69, a Candidate Gene for Obesity-Associated Diabetes Contributed by Mouse Locus Nidd/SJL , 2009, PLoS genetics.

[15]  Min Zhang,et al.  Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases , 2007, BMC Bioinformatics.

[16]  Johan Auwerx,et al.  Visceral Obesity is Associated with High Levels of Serum Squalene , 2006, Obesity.

[17]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[18]  M. Yuan Efficient Computation of ℓ1 Regularized Estimates in Gaussian Graphical Models , 2008 .

[19]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[20]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[21]  Robert Dent,et al.  Distinct skeletal muscle fiber characteristics and gene expression in diet-sensitive versus diet-resistant obesity , 2010, Journal of Lipid Research.

[22]  Andrew G. Clark,et al.  Mapping Multiple Quantitative Trait Loci by Bayesian Classification , 2005, Genetics.

[23]  Benjamin A. Logsdon,et al.  Gene Expression Network Reconstruction by Convex Feature Selection when Incorporating Genetic Perturbations , 2010, PLoS Comput. Biol..

[24]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[25]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[26]  Benjamin A. Logsdon,et al.  A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis , 2010, BMC Bioinformatics.

[27]  I. Johnstone,et al.  Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences , 2004, math/0410088.

[28]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[29]  M. Rockman,et al.  Reverse engineering the genotype–phenotype map with natural genetic variation , 2008, Nature.

[30]  Hongzhe Li,et al.  A SPARSE CONDITIONAL GAUSSIAN GRAPHICAL MODEL FOR ANALYSIS OF GENETICAL GENOMICS DATA. , 2011, The annals of applied statistics.

[31]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[32]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.

[33]  Anne-Laure Boulesteix,et al.  Regularized estimation of large-scale gene association networks using graphical Gaussian models , 2009, BMC Bioinformatics.

[34]  P. Dent,et al.  Down-regulation of Cholesterol 7α-Hydroxylase (CYP7A1) Gene Expression by Bile Acids in Primary Rat Hepatocytes Is Mediated by the c-Jun N-terminal Kinase Pathway* , 2001, The Journal of Biological Chemistry.

[35]  Kenneth Lange,et al.  Stability selection for genome‐wide association , 2011, Genetic epidemiology.

[36]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[37]  Clayton Hunt,et al.  Identification of a Novel Putative Gastrointestinal Stem Cell and Adenoma Stem Cell Marker, Doublecortin and CaM Kinase‐Like‐1, Following Radiation Injury and in Adenomatous Polyposis Coli/Multiple Intestinal Neoplasia Mice , 2008, Stem cells.

[38]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[39]  S. Horvath,et al.  Evidence for anti-Burkitt tumour globulins in Burkitt tumour patients and healthy individuals. , 1967, British Journal of Cancer.

[40]  Elisabeth Brambilla,et al.  DNA repair by ERCC1 in non-small-cell lung cancer and cisplatin-based adjuvant chemotherapy. , 2006, The New England journal of medicine.

[41]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .

[42]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[43]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[44]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[45]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[46]  Claude Bouchard,et al.  The Human Obesity Gene Map: The 2005 Update , 2006, Obesity research.

[47]  Vincent Frouin,et al.  Gene Association Networks from Microarray Data Using a Regularized Estimation of Partial Correlation Based on PLS Regression , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  E. Schadt,et al.  Genetic and Genomic Analysis of a Fat Mass Trait with Complex Inheritance Reveals Marked Sex Specificity , 2006, PLoS genetics.

[49]  Nir Friedman,et al.  Inferring subnetworks from perturbed expression profiles , 2001, ISMB.

[50]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[51]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[52]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[53]  Jun Zhu,et al.  Increasing the Power to Detect Causal Associations by Combining Genotypic and Expression Data in Segregating Populations , 2007, PLoS Comput. Biol..

[54]  A. Keeton,et al.  Insulin Signal Transduction Pathways and Insulin-induced Gene Expression* , 2002, The Journal of Biological Chemistry.

[55]  H. Stefánsson,et al.  Genetics of gene expression and its effect on disease , 2008, Nature.

[56]  J. Castle,et al.  An integrative genomics approach to infer causal associations between gene expression and disease , 2005, Nature Genetics.

[57]  Carter T. Butts,et al.  network: A Package for Managing Relational Data in R , 2008 .

[58]  S. Geer,et al.  Adaptive Lasso for High Dimensional Regression and Gaussian Graphical Modeling , 2009, 0903.2515.

[59]  Insuk Sohn,et al.  Hepatic gene expression profiles in a long-term high-fat diet-induced obesity mouse model. , 2004, Gene.

[60]  R. Cone,et al.  Editorial: The Corticotropin-Releasing Hormone System and Feeding Behavior-A Complex Web Begins to Unravel. , 2000, Endocrinology.

[61]  Martina Morris,et al.  A statnet Tutorial. , 2008, Journal of statistical software.

[62]  S. Horvath,et al.  Variations in DNA elucidate molecular networks that cause disease , 2008, Nature.

[63]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[64]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.