Learning mixed graphical models with separate sparsity parameters and stability-based model selection

BackgroundMixed graphical models (MGMs) are graphical models learned over a combination of continuous and discrete variables. Mixed variable types are common in biomedical datasets. MGMs consist of a parameterized joint probability density, which implies a network structure over these heterogeneous variables. The network structure reveals direct associations between the variables and the joint probability density allows one to ask arbitrary probabilistic questions on the data. This information can be used for feature selection, classification and other important tasks.ResultsWe studied the properties of MGM learning and applications of MGMs to high-dimensional data (biological and simulated). Our results show that MGMs reliably uncover the underlying graph structure, and when used for classification, their performance is comparable to popular discriminative methods (lasso regression and support vector machines). We also show that imposing separate sparsity penalties for edges connecting different types of variables significantly improves edge recovery performance. To choose these sparsity parameters, we propose a new efficient model selection method, named Stable Edge-specific Penalty Selection (StEPS). StEPS is an expansion of an earlier method, StARS, to mixed variable types. In terms of edge recovery, StEPS selected MGMs outperform those models selected using standard techniques, including AIC, BIC and cross-validation. In addition, we use a heuristic search that is linear in size of the sparsity value search space as opposed to the cubic grid search required by other model selection methods. We applied our method to clinical and mRNA expression data from the Lung Genomics Research Consortium (LGRC) and the learned MGM correctly recovered connections between the diagnosis of obstructive or interstitial lung disease, two diagnostic breathing tests, and cigarette smoking history. Our model also suggested biologically relevant mRNA markers that are linked to these three clinical variables.ConclusionsMGMs are able to accurately recover dependencies between sets of continuous and discrete variables in both simulated and biomedical datasets. Separation of sparsity penalties by edge type is essential for accurate network edge recovery. Furthermore, our stability based method for model selection determines sparsity parameters faster and more accurately (in terms of edge recovery) than other model selection methods. With the ongoing availability of comprehensive clinical and biomedical datasets, MGMs are expected to become a valuable tool for investigating disease mechanisms and answering an array of critical healthcare questions.

[1]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[2]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[3]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[4]  Paolo Zaffaroni,et al.  Pseudo-maximum likelihood estimation of ARCH(∞) models , 2005, math/0607798.

[5]  Ruth Tal-Singer,et al.  Fibrinogen, COPD and Mortality in a Nationally Representative U.S. Cohort , 2012, COPD.

[6]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[7]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  Robert Castelo,et al.  Learning high-dimensional mixed graphical models with missing values , 2011 .

[10]  Naftali Kaminski,et al.  T-ReCS: Stable Selection of Dynamically Formed Groups of Features with Application to Prediction of Clinical Outcomes , 2014, Pacific Symposium on Biocomputing.

[11]  N. Wermuth,et al.  Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative , 1989 .

[12]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[13]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[14]  Andrew J. Sedgewick,et al.  Learning subgroup-specific regulatory interactions and regulator independence with PARADIGM , 2013, Bioinform..

[15]  Seyoung Kim,et al.  Correction: Learning Gene Networks under SNP Perturbations Using eQTL Datasets , 2014, PLoS Comput. Biol..

[16]  Naftali Kaminski,et al.  MMP1 and MMP7 as Potential Peripheral Blood Biomarkers in Idiopathic Pulmonary Fibrosis , 2008, PLoS medicine.

[17]  Béla Bollobás,et al.  Directed scale-free graphs , 2003, SODA '03.

[18]  John D. Storey,et al.  Genetic interactions between polymorphisms that affect gene expression in yeast , 2005, Nature.

[19]  Larry A. Wasserman,et al.  The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs , 2009, J. Mach. Learn. Res..

[20]  Jeffrey S. Morris,et al.  iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data , 2012, Bioinform..

[21]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[22]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[23]  Grazyna D. Szklarz,et al.  and Xenobiotic Metabolism and Utility in Understanding Drug and Human Cytochrome P 450 1 A 1 Structure Protein Structure and Folding , 2013 .

[24]  Pradeep Ravikumar,et al.  Mixed Graphical Models via Exponential Families , 2014, AISTATS.

[25]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[26]  B. Efron,et al.  The Jackknife: The Bootstrap and Other Resampling Plans. , 1983 .

[27]  Ali Shojaie,et al.  Selection and estimation for mixed graphical models. , 2013, Biometrika.

[28]  Larry A. Wasserman,et al.  Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models , 2010, NIPS.

[29]  S. Pandey,et al.  What Are Degrees of Freedom , 2008 .

[30]  Mikko Rönty,et al.  Sputum Proteomics Identifies New Potential Markers For Chronic Obstructive Pulmonary Disease (COPD) , 2012, ATS 2012.

[31]  Larry A. Wasserman,et al.  The huge Package for High-dimensional Undirected Graph Estimation in R , 2012, J. Mach. Learn. Res..

[32]  Trevor J. Hastie,et al.  Structure Learning of Mixed Graphical Models , 2013, AISTATS.

[33]  Peter Bühlmann,et al.  Stable graphical model estimation with Random Forests for discrete, continuous, and mixed variables , 2011, Comput. Stat. Data Anal..