Information Enhanced Model Selection for High-Dimensional Gaussian Graphical Model with Application to Metabolomic Data

In light of the low signal-to-noise nature of many large biological data sets, we propose a novel method to identify the structure of association networks using a Gaussian graphical model combined with prior knowledge. Our algorithm includes the following two parts. In the first part we propose a model selection criterion called structural Bayesian information criterion (SBIC) in which the prior structure is modeled and incorporated into the Bayesian information criterion (BIC). It is shown that the popular extended BIC (EBIC) is a special case of SBIC. In second part we propose a two-step algorithm to construct the candidate model pool. The algorithm is data-driven and the prior structure is embedded into the candidate model automatically. Theoretical investigation shows that under some mild conditions SBIC is a consistent model selection criterion for the high-dimensional Gaussian graphical model. Simulation studies validate the superiority of the SBIC over the standard BIC and show the robustness to the model misspecification. Application to relative concentration data from infant feces collected from subjects enrolled in a large molecular epidemiologic cohort study validates that prior knowledge on metabolic pathway involvement is a statistically significant factor for the conditional dependence among metabolites. More importantly, new relationships among metabolites are identified through the proposed algorithm which can not be covered by conventional pathway analysis. Some of them have been widely recognized in the literature.

[1]  Xing Qiu,et al.  High-dimensional linear state space models for dynamic microbial interaction networks , 2017, PloS one.

[2]  Trey Ideker,et al.  Boosting Signal-to-Noise in Complex Biology: Prior Knowledge Is Power , 2011, Cell.

[3]  Hongzhe Li,et al.  Association of Cesarean Delivery and Formula Supplementation With the Intestinal Microbiome of 6-Week-Old Infants. , 2016, JAMA pediatrics.

[4]  Zehua Chen,et al.  EXTENDED BIC FOR SMALL-n-LARGE-P SPARSE GLM , 2012 .

[5]  Ayellet V. Segrè,et al.  Common Inherited Variation in Mitochondrial Genes Is Not Enriched for Associations with Type 2 Diabetes or Related Glycemic Traits , 2010, PLoS genetics.

[6]  Oleg Okun,et al.  Bayesian Variable Selection , 2014 .

[7]  R. Reimer,et al.  Diet-induced changes in maternal gut microbiota and metabolomic profiles influence programming of offspring obesity risk in rats , 2016, Scientific Reports.

[8]  Sara van de Geer,et al.  The group Lasso , 2011 .

[9]  N. Holland,et al.  Distribution and biomarkers of carbon‐14‐labeled fullerene C60 ([14C(U)]C60) in female rats and mice for up to 30 days after intravenous exposure , 2015, Journal of applied toxicology : JAT.

[10]  Oliver Fiehn,et al.  MetaMapp: mapping and visualizing metabolomic data by integrating information from biochemical pathways and chemical and mass spectral similarity , 2012, BMC Bioinformatics.

[11]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[12]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[13]  Bruce A. Shapiro,et al.  Secondary structure computer prediction of the poliovirus 5' non-coding region is improved by a genetic algorithm , 1997, Comput. Appl. Biosci..

[14]  S. McRitchie,et al.  Integrating metabolomic signatures and psychosocial parameters in responsivity to an immersion treatment model for adolescent obesity , 2012, Metabolomics.

[15]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[16]  F. Bäckhed,et al.  Linkage between cellular communications, energy utilization, and proliferation in metastatic neuroendocrine cancers , 2006, Proceedings of the National Academy of Sciences.

[17]  Metabolomics of brain and reproductive organs: characterizing the impact of gestational exposure to butylbenzyl phthalate on dams and resultant offspring , 2012, Metabolomics.

[18]  K. McMartin,et al.  Propylene glycol-mediated cell injury in a primary culture of human proximal tubule cells. , 1998, Toxicological sciences : an official journal of the Society of Toxicology.

[19]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[20]  Giovanni Scardoni,et al.  Metscape 2 bioinformatics tool for the analysis and visualization of metabolomics and gene expression data , 2012, Bioinform..

[21]  N. Zhang,et al.  Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces With Applications in Genomics , 2010 .

[22]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[23]  Jing Gao,et al.  Metscape: a Cytoscape plug-in for visualizing and interpreting metabolomic data in the context of human metabolic networks , 2010, Bioinform..

[24]  Rina Foygel,et al.  Extended Bayesian Information Criteria for Gaussian Graphical Models , 2010, NIPS.

[25]  Kwanjeera Wanichthanarak,et al.  MetaMapR: pathway independent metabolomic network analysis incorporating unknowns , 2015, Bioinform..

[26]  Satoru Miyano,et al.  Combining microarrays and biological knowledge for estimating gene networks via Bayesian networks , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[27]  Edward W. Lee,et al.  A Microbiomic Analysis in African Americans with Colonic Lesions Reveals Streptococcus sp.VT162 as a Marker of Neoplastic Transformation , 2017, Genes.

[28]  Joseph G. Ibrahim,et al.  Bayesian Variable Selection , 2000 .

[29]  R. Tyl,et al.  Metabolomics in the assessment of chemical‐induced reproductive and developmental outcomes using non‐invasive biological fluids: application to the study of butylbenzyl phthalate , 2009, Journal of applied toxicology : JAT.

[30]  K. McMartin,et al.  Acute toxicity of propylene glycol: an assessment using cultured proximal tubule cells of human origin. , 1994, Fundamental and applied toxicology : official journal of the Society of Toxicology.

[31]  R. Tibshirani,et al.  The solution path of the generalized lasso , 2010, 1005.1971.

[32]  Simeone Marino,et al.  Mathematical modeling of primary succession of murine intestinal microbiota , 2013, Proceedings of the National Academy of Sciences.

[33]  Barbara McGillivray,et al.  The citation advantage of linking publications to research data , 2019, PloS one.

[34]  D. Siegmund Model selection in irregular problems: Applications to mapping quantitative trait loci , 2004 .

[35]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[36]  Christine B Peterson,et al.  Bayesian Inference of Multiple Gaussian Graphical Models , 2015, Journal of the American Statistical Association.

[37]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[38]  J. Ghosh,et al.  Modifying the Schwarz Bayesian Information Criterion to Locate Multiple Interacting Quantitative Trait Loci , 2004, Genetics.

[39]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[40]  Haifeng Lu,et al.  Symbiotic gut microbes modulate human metabolic phenotypes , 2008, Proceedings of the National Academy of Sciences.

[41]  K. McMartin,et al.  Acute Toxicity of Propylene Glycol: An Assessment Using Cultured Proximal Tubule Cells of Human Origin , 1994 .

[42]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[43]  Jing Ma Estimation and Inference for High-Dimensional Gaussian Graphical Models with Structural Constraints. , 2015 .

[44]  Jie Cheng,et al.  A sparse Ising model with covariates. , 2014, Biometrics.

[45]  A. Roverato Hyper Inverse Wishart Distribution for Non-decomposable Graphs and its Application to Bayesian Inference for Gaussian Graphical Models , 2002 .

[46]  Edward R. Dougherty,et al.  Incorporating biological prior knowledge for Bayesian learning via maximal knowledge-driven information priors , 2017, BMC Bioinformatics.

[47]  P. Shannon,et al.  Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing , 2010, Science.

[48]  Erin E. Carlson,et al.  Targeted profiling: quantitative analysis of 1H NMR metabolomics data. , 2006, Analytical chemistry.