Variational Inference for sparse network reconstruction from count data

In multivariate statistics, the question of finding direct interactions can be formulated as a problem of network inference - or network reconstruction - for which the Gaussian graphical model (GGM) provides a canonical framework. Unfortunately, the Gaussian assumption does not apply to count data which are encountered in domains such as genomics, social sciences or ecology. To circumvent this limitation, state-of-the-art approaches use two-step strategies that first transform counts to pseudo Gaussian observations and then apply a (partial) correlation-based approach from the abundant literature of GGM inference. We adopt a different stance by relying on a latent model where we directly model counts by means of Poisson distributions that are conditional to latent (hidden) Gaussian correlated variables. In this multivariate Poisson lognormal-model, the dependency structure is completely captured by the latent layer. This parametric model enables to account for the effects of covariates on the counts. To perform network inference, we add a sparsity inducing constraint on the inverse covariance matrix of the latent Gaussian vector. Unlike the usual Gaussian setting, the penalized likelihood is generally not tractable, and we resort instead to a variational approach for approximate likelihood maximization. The corresponding optimization problem is solved by alternating a gradient ascent on the variational parameters and a graphical-Lasso step on the covariance matrix. We show that our approach is highly competitive with the existing methods on simulation inspired from microbiological data. We then illustrate on three various data sets how accounting for sampling efforts via offsets and integrating external covariates (which is mostly never done in the existing literature) drastically changes the topology of the inferred network.

[1]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[2]  Larry A. Wasserman,et al.  The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs , 2009, J. Mach. Learn. Res..

[3]  J. Aitchison,et al.  The multivariate Poisson-log normal distribution , 1989 .

[4]  Eun Sug Park,et al.  Multivariate Poisson-Lognormal Models for Jointly Modeling Crash Frequency by Severity , 2007 .

[5]  S. Chib,et al.  Understanding the Metropolis-Hastings Algorithm , 1995 .

[6]  Alireza Tamaddoni-Nezhad,et al.  Learning ecological networks from next-generation sequencing data , 2016 .

[7]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[8]  A. Willsky,et al.  Latent variable graphical model selection via convex optimization , 2010 .

[9]  Peer Bork,et al.  Determinants of community structure in the global plankton interactome , 2015, Science.

[10]  Krister Svanberg,et al.  A Class of Globally Convergent Optimization Methods Based on Conservative Convex Separable Approximations , 2002, SIAM J. Optim..

[11]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Vladimir Jojic,et al.  Learning Microbial Interaction Networks from Metagenomic Count Data , 2015, RECOMB.

[14]  Pradeep Ravikumar,et al.  Graphical Models via Generalized Linear Models , 2012, NIPS.

[15]  Genevera I. Allen,et al.  A Log-Linear Graphical Model for inferring genetic networks from high-throughput sequencing data , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[16]  S. Aerts,et al.  Mapping gene regulatory networks from single-cell omics data , 2018, Briefings in functional genomics.

[17]  Paul Damien,et al.  A multivariate Poisson-lognormal regression model for prediction of crash counts by severity, using Bayesian methods. , 2008, Accident; analysis and prevention.

[18]  Larry A. Wasserman,et al.  Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models , 2010, NIPS.

[19]  T. Cai,et al.  A Constrained ℓ1 Minimization Approach to Sparse Precision Matrix Estimation , 2011, 1102.2233.

[20]  Nathalie Villa-Vialaneix,et al.  Multiple hot‐deck imputation for network inference from RNA sequencing data , 2018, Bioinform..

[21]  Fabian J Theis,et al.  Decoding the Regulatory Network for Blood Development from Single-Cell Gene Expression Measurements , 2015, Nature Biotechnology.

[22]  Heath Strobl D. J. Harris , 2017 .

[23]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[24]  Loïc Schwaller,et al.  Deciphering the Pathobiome: Intra- and Interkingdom Interactions Involving the Pathogen Erysiphe alphitoides , 2016, Microbial Ecology.

[25]  Michael Greenacre Fuzzy coding in constrained ordinations. , 2013, Ecology.

[26]  David J. Harris,et al.  Inferring species interactions from co-occurrence data with Markov networks. , 2016, Ecology.

[27]  Rina Foygel,et al.  Extended Bayesian Information Criteria for Gaussian Graphical Models , 2010, NIPS.

[28]  Pradeep Ravikumar,et al.  A review of multivariate distributions for count data derived from the Poisson distribution , 2016, Wiley interdisciplinary reviews. Computational statistics.

[29]  D. Karlis EM Algorithm for Mixed Poisson and Other Discrete Distributions , 2005, ASTIN Bulletin.

[30]  Christian L. Müller,et al.  Identifying direct contacts between protein complex subunits from their conditional dependence in proteomics datasets , 2017, PLoS Comput. Biol..

[31]  Michaela Aschan,et al.  Fish assemblages in the Barents Sea , 2006 .

[32]  Curtis Huttenhower,et al.  A Bayesian method for detecting pairwise associations in compositional data , 2017, PLoS Comput. Biol..

[33]  Xiangtian Yu,et al.  Unravelling personalized dysfunctional gene network of complex diseases based on differential network model , 2015, Journal of Translational Medicine.

[34]  David J. Harris Inferring species interactions from co-occurrence data with Markov networks , 2015, bioRxiv.

[35]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[36]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..

[37]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[38]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[39]  K. Khare,et al.  A convex pseudolikelihood framework for high dimensional partial correlation estimation with convergence guarantees , 2013, 1307.5381.

[40]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[41]  Christian L. Müller,et al.  Sparse and Compositionally Robust Inference of Microbial Ecological Networks , 2014, PLoS Comput. Biol..

[42]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[43]  Stéphane Robin,et al.  Variational inference for probabilistic Poisson PCA , 2017, The Annals of Applied Statistics.

[44]  A. Agresti An introduction to categorical data analysis , 1997 .

[45]  Jan Lepš,et al.  Multivariate Analysis of Ecological Data , 2006 .

[46]  Andrea Rau,et al.  A Hierarchical Poisson Log-Normal Model for Network Inference from RNA Sequencing Data , 2013, PloS one.