Graphical modeling of binary data using the LASSO: a simulation study

BackgroundGraphical models were identified as a promising new approach to modeling high-dimensional clinical data. They provided a probabilistic tool to display, analyze and visualize the net-like dependence structures by drawing a graph describing the conditional dependencies between the variables. Until now, the main focus of research was on building Gaussian graphical models for continuous multivariate data following a multivariate normal distribution. Satisfactory solutions for binary data were missing. We adapted the method of Meinshausen and Bühlmann to binary data and used the LASSO for logistic regression. Objective of this paper was to examine the performance of the Bolasso to the development of graphical models for high dimensional binary data. We hypothesized that the performance of Bolasso is superior to competing LASSO methods to identify graphical models.MethodsWe analyzed the Bolasso to derive graphical models in comparison with other LASSO based method. Model performance was assessed in a simulation study with random data generated via symmetric local logistic regression models and Gibbs sampling. Main outcome variables were the Structural Hamming Distance and the Youden Index.We applied the results of the simulation study to a real-life data with functioning data of patients having head and neck cancer.ResultsBootstrap aggregating as incorporated in the Bolasso algorithm greatly improved the performance in higher sample sizes. The number of bootstraps did have minimal impact on performance. Bolasso performed reasonable well with a cutpoint of 0.90 and a small penalty term. Optimal prediction for Bolasso leads to very conservative models in comparison with AIC, BIC or cross-validated optimal penalty terms.ConclusionsBootstrap aggregating may improve variable selection if the underlying selection process is not too unstable due to small sample size and if one is mainly interested in reducing the false discovery rate. We propose using the Bolasso for graphical modeling in large sample sizes.

[1]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[2]  Christophe Ambroise,et al.  Inferring sparse Gaussian graphical models with latent structure , 2008, 0810.3177.

[3]  A. Agresti Categorical data analysis , 1993 .

[4]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[5]  C. Camargo,et al.  Methodological considerations, such as directed acyclic graphs, for studying "acute on chronic" disease epidemiology: chronic obstructive pulmonary disease example. , 2009, Journal of clinical epidemiology.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[8]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[9]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[12]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[13]  Eric Jougla,et al.  An empirical comparative study of approximate methods for binary graphical models; application to the search of associations among causes of death in French death certificates , 2010, 1004.2287.

[14]  H Nazirah,et al.  THE APPLICATIONS OF INTERNATIONAL CLASSIFICATION OF FUNCTIONING, DISABILITY AND HEALTH (ICF) BY WORLD HEALTH ORGANIZATION(WHO) IN REHABILITATION MEDICINE PRACTICE , 2007 .

[15]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[16]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[17]  Martin J. Wainwright,et al.  High-Dimensional Graphical Model Selection Using ℓ1-Regularized Logistic Regression , 2006, NIPS.

[18]  Yuehua Wu,et al.  Tuning parameter selection for penalized likelihood estimation of inverse covariance matrix , 2009 .

[19]  P. Bühlmann,et al.  Statistical Applications in Genetics and Molecular Biology Low-Order Conditional Independence Graphs for Inferring Genetic Networks , 2011 .

[20]  R. Kohn,et al.  Efficient estimation of covariance selection models , 2003 .

[21]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[22]  J. Robins,et al.  Instruments for Causal Inference: An Epidemiologist's Dream? , 2006, Epidemiology.

[23]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[24]  Pei Wang,et al.  Learning networks from high dimensional binary data: An application to genomic instability data , 2009, 0908.3882.

[25]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[26]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[27]  R. Strobl,et al.  Graphical modeling can be used to illustrate associations between variables describing functioning in head and neck cancer patients. , 2011, Journal of clinical epidemiology.

[28]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  C. S. Yost Acute on Chronic , 2013 .

[30]  D. Edwards Introduction to graphical modelling , 1995 .

[31]  F. Bunea Honest variable selection in linear and logistic regression models via $\ell_1$ and $\ell_1+\ell_2$ penalization , 2008, 0808.4051.

[32]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[33]  M. T. J. Buñuales,et al.  La clasificación internacional del funcionamiento de la discapacidad y de la salud (CIF) 2001 , 2002 .

[34]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[35]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[36]  Ulrich Mansmann,et al.  Graphical models illustrated complex associations between variables describing human functioning. , 2009, Journal of clinical epidemiology.

[37]  Peter Bühlmann,et al.  Understanding human functioning using graphical models , 2010, BMC medical research methodology.

[38]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.