Accelerating Bayesian inference of dependency between complex biological traits

Inferring dependencies between complex biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck — integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to study Aquilegia flower and pollinator co-evolution.

[1]  Guy Baele,et al.  Principled, practical, flexible, fast: A new approach to phylogenetic factor analysis , 2021, Methods in ecology and evolution.

[2]  Gareth O. Roberts,et al.  High-dimensional scaling limits of piecewise deterministic sampling algorithms , 2018, The Annals of Applied Probability.

[3]  Marc A. Suchard,et al.  Hamiltonian zigzag sampler got more momentum than its Markovian counterpart: Equivalence of two zigzags under a momentum refreshment limit , 2021, 2104.07694.

[4]  B. Olusola,et al.  Non-synonymous Substitutions in HIV-1 GAG Are Frequent in Epitopes Outside the Functionally Conserved Regions and Associated With Subtype Differences , 2021, Frontiers in Microbiology.

[5]  Mao Wenjun,et al.  Swine Influenza Virus: Current Status and Challenge. , 2020, Virus research.

[6]  Hao Wang,et al.  N-Linked Glycan Sites on the Influenza A Virus Neuraminidase Head Domain Are Required for Efficient Viral Incorporation and Replication , 2020, Journal of Virology.

[7]  J. Liao,et al.  Role of Protein Glycosylation in Host-Pathogen Interaction , 2020, Cells.

[8]  M. Nelson,et al.  When Pigs Fly: Pandemic influenza enters the 21st century , 2020, PLoS pathogens.

[9]  D. Dunson,et al.  Discontinuous Hamiltonian Monte Carlo for discrete parameters and discontinuous likelihoods , 2017, 1705.08510.

[10]  D. Dunson,et al.  The Hastings algorithm at fifty , 2020 .

[11]  M. Suchard,et al.  Large-scale inference of correlation among mixed-type biological traits with phylogenetic multivariate probit models , 2019, 1912.09185.

[12]  P. Fearnhead,et al.  The Zig-Zag process and super-efficient sampling for Bayesian analysis of big data , 2016, The Annals of Statistics.

[13]  Daniel L. Ayres,et al.  Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 , 2018, Virus evolution.

[14]  Max R. Tolkoff,et al.  Phylogenetic Factor Analysis. , 2017, Systematic biology.

[15]  Paul Fearnhead,et al.  Piecewise Deterministic Markov Processes for Continuous-Time Monte Carlo , 2016, Statistical Science.

[16]  Fabian J Theis,et al.  Network inference from glycoproteomics data reveals new reactions in the IgG glycosylation pathway , 2017, Nature Communications.

[17]  J. Fellay,et al.  Viral genetic variation accounts for a third of variability in HIV-1 set-point viral load in Europe , 2017, PLoS biology.

[18]  James S. Clark,et al.  Generalized joint attribute modeling for biodiversity analysis: median-zero, multivariate, multifarious data , 2017 .

[19]  Kathryn M. Irvine,et al.  Extending Ordinal Regression with a Latent Zero-Augmented Beta Distribution , 2016 .

[20]  Tony Pourmohamad,et al.  Multivariate Stochastic Process Models for Correlated Responses of Mixed Type , 2016 .

[21]  A. Doucet,et al.  The Bouncy Particle Sampler: A Nonreversible Rejection-Free Markov Chain Monte Carlo Method , 2015, 1510.02451.

[22]  Jessica L. Prince,et al.  Replicative fitness of transmitted HIV-1 drives acute immune activation, proviral load in memory CD4+ T cells, and disease progression , 2015, Proceedings of the National Academy of Sciences.

[23]  Trevor Bedford,et al.  ASSESSING PHENOTYPIC CORRELATION THROUGH THE MULTIVARIATE PHYLOGENETIC LATENT LIABILITY MODEL. , 2014, The annals of applied statistics.

[24]  Angela R. McLean,et al.  Impact of HLA-driven HIV adaptation on virulence in populations of high HIV seroprevalence , 2014, Proceedings of the National Academy of Sciences.

[25]  M. Quesada,et al.  A quantitative review of pollination syndromes: do floral traits predict effective pollinators? , 2014, Ecology letters.

[26]  S. Maurer-Stroh,et al.  Playing Hide and Seek: How Glycosylation of the Influenza Virus Hemagglutinin Can Modulate the Immune Response to Infection , 2014, Viruses.

[27]  Xia Jiang,et al.  Modeling the Altered Expression Levels of Genes on Signaling Pathways in Tumors As Causal Bayesian Networks , 2014, Cancer informatics.

[28]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[29]  Babak Shahbaba,et al.  Split Hamiltonian Monte Carlo , 2011, Stat. Comput..

[30]  R. Webster,et al.  Evolution and ecology of influenza A viruses. , 1992, Current topics in microbiology and immunology.

[31]  Abraham L. Wickelgren,et al.  Economic epidemiology of avian influenza on smallholder poultry farms☆ , 2013, Theoretical population biology.

[32]  David Heckerman,et al.  Significant Reductions in Gag-Protease-Mediated HIV-1 Replication Capacity during the Course of the Epidemic in Japan , 2012, Journal of Virology.

[33]  Forrest W. Crawford,et al.  Unifying the spatial epidemiology and molecular evolution of emerging epidemics , 2012, Proceedings of the National Academy of Sciences.

[34]  Jennifer A. Hoeting,et al.  Multilevel Latent Gaussian Process Model for Mixed Discrete and Continuous Multivariate Response Data , 2012, 1205.4163.

[35]  Valerii Fedorov,et al.  Optimal dose‐finding designs with correlated continuous and discrete responses , 2012, Statistics in medicine.

[36]  Zabrina L. Brumme,et al.  Impact of HLA-B*81-Associated Mutations in HIV-1 Gag on Viral Replication Capacity , 2012, Journal of Virology.

[37]  William A. Walters,et al.  Evolutionary Inferences from Phylogenies: A Review of Methods , 2012 .

[38]  Andrew Gelman,et al.  Handbook of Markov Chain Monte Carlo , 2011 .

[39]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[40]  David Heckerman,et al.  Progression to AIDS in South Africa Is Associated with both Reverting and Compensatory Viral Mutations , 2011, PloS one.

[41]  David Heckerman,et al.  Gag-Protease-Mediated Replication Capacity in HIV-1 Subtype C Chronic Infection: Associations with HLA Type and Clinical Parameters , 2010, Journal of Virology.

[42]  Dorota Kurowicka,et al.  Generating random correlation matrices based on vines and extended onion method , 2009, J. Multivar. Anal..

[43]  Eric J. Arts,et al.  Variable Fitness Impact of HIV-1 Escape Mutations to Cytotoxic T Lymphocyte (CTL) Response , 2009, PLoS pathogens.

[44]  K. Wright,et al.  The strength and genetic basis of reproductive isolating barriers in flowering plants , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[45]  Bongkyun Park,et al.  Transmission of Avian Influenza Virus (H3N2) to Dogs , 2008, Emerging infectious diseases.

[46]  S. Hodges,et al.  Pollinator shifts drive increasingly long nectar spurs in columbine flowers , 2007, Nature.

[47]  Philip J. R. Goulder,et al.  Compensatory Mutation Partially Restores Fitness and Delays Reversion of Escape Mutation within the Immunodominant HLA-B*5703-Restricted Gag Epitope in Chronic Human Immunodeficiency Virus Type 1 Infection , 2007, Journal of Virology.

[48]  E. Hairer,et al.  Simulating Hamiltonian dynamics , 2006, Math. Comput..

[49]  D. Hebert,et al.  N-linked glycans direct the cotranslational folding pathway of influenza hemagglutinin. , 2003, Molecular cell.

[50]  S. Hodges,et al.  Genetics of Floral Traits Influencing Reproductive Isolation between Aquilegia formosa and Aquilegia pubescens , 2002, The American Naturalist.

[51]  Michelle Fulton,et al.  Floral isolation between Aquilegia formosa and Aquilegia pubescens , 1999, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[52]  L. Kasturi,et al.  The amino acid following an asn-X-Ser/Thr sequon is an important determinant of N-linked core glycosylation efficiency. , 1998, Biochemistry.

[53]  Wei R. Chen,et al.  The Number and Location of Glycans on Influenza Hemagglutinin Determine Folding and Association with Calnexin and Calreticulin , 1997, The Journal of cell biology.

[54]  Jun S. Liu,et al.  Covariance Structure and Convergence Rate of the Gibbs Sampler with Various Scans , 1995 .

[55]  M. Pagel Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters , 1994, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[56]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[57]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[58]  Jack Cartinhour,et al.  One-dimensional marginal density functions of a truncated multivariate normal density function , 1990 .

[59]  J. Felsenstein Phylogenies and the Comparative Method , 1985, The American Naturalist.

[60]  I. Wilson,et al.  A carbohydrate side chain on hemagglutinins of Hong Kong influenza viruses inhibits recognition by a monoclonal antibody. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[61]  C. J-F,et al.  THE COALESCENT , 1980 .

[62]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[63]  G. Strang On the Construction and Comparison of Difference Schemes , 1968 .

[64]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.