Principled, practical, flexible, fast: A new approach to phylogenetic factor analysis

Biological phenotypes are products of complex evolutionary processes in which selective forces influence multiple biological trait measurements in unknown ways. Phylogenetic factor analysis disentangles these relationships across the evolutionary history of a group of organisms. Scientists seeking to employ this modeling framework confront numerous modeling and implementation decisions, the details of which pose computational and replicability challenges. General and impactful community employment requires a data scientific analysis plan that balances flexibility, speed and ease of use, while minimizing model and algorithm tuning. Even in the presence of non-trivial phylogenetic model constraints, we show that one may analytically address latent factor uncertainty in a way that (a) aids model flexibility, (b) accelerates computation (by as much as 500-fold) and (c) decreases required tuning. We further present practical guidance on inference and modeling decisions as well as diagnosing and solving common problems in these analyses. We codify this analysis plan in an automated pipeline that distills the potentially overwhelming array of modeling decisions into a small handful of (typically binary) choices. We demonstrate the utility of these methods and analysis plan in four real-world problems of varying scales.

[1]  Jun S. Liu,et al.  Covariance Structure and Convergence Rate of the Gibbs Sampler with Various Scans , 1995 .

[2]  Hélène Morlon,et al.  A Penalized Likelihood Framework for High‐Dimensional Phylogenetic Comparative Methods and an Application to New‐World Monkeys Brain Evolution , 2018, Systematic biology.

[3]  Alexander Shapiro,et al.  Identifiability of factor analysis: Some results and open problems , 1985 .

[4]  T. Blackburn,et al.  Evidence for a fast-slow continuum of life-history traits among parasitoid Hymenoptera , 1991 .

[5]  S. Johnson,et al.  Phylogenetic evidence for pollinator-driven diversification of angiosperms. , 2012, Trends in ecology & evolution.

[6]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[7]  Michael A. West,et al.  BAYESIAN MODEL ASSESSMENT IN FACTOR ANALYSIS , 2004 .

[8]  U. Dieckmann,et al.  Life-history implications of the allometric scaling of growth. , 2014, Journal of theoretical biology.

[9]  Liam J. Revell,et al.  Size-Correction and Principal Components for Interspecific Comparative Studies , 2009, Evolution; international journal of organic evolution.

[10]  Sik-Yum Lee,et al.  Basic and Advanced Bayesian Structural Equation Modeling: With Applications in the Medical and Behavioral Sciences , 2012 .

[11]  Stéphane Robin,et al.  Inference of Adaptive Shifts for Multivariate Correlated Traits , 2017, bioRxiv.

[12]  G. Larson,et al.  The Evolution of Animal Domestication , 2014 .

[13]  Bruce D. Smith,et al.  The Molecular Genetics of Crop Domestication , 2006, Cell.

[14]  M. West,et al.  Bayesian Dynamic Factor Models and Portfolio Allocation , 2000 .

[15]  T. F. Hansen STABILIZING SELECTION AND THE COMPARATIVE ANALYSIS OF ADAPTATION , 1997, Evolution; international journal of organic evolution.

[16]  J. L. Gittleman,et al.  The Fast‐Slow Continuum in Mammalian Life History: An Empirical Reevaluation , 2007, The American Naturalist.

[17]  Max R. Tolkoff,et al.  Phylogenetic Factor Analysis. , 2017, Systematic biology.

[18]  L. Revell,et al.  Testing quantitative genetic hypotheses about the evolutionary rate matrix for continuous characters , 2008 .

[19]  Daniel L. Ayres,et al.  Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 , 2018, Virus evolution.

[20]  S. Hodges,et al.  Pollinator shifts drive increasingly long nectar spurs in columbine flowers , 2007, Nature.

[21]  Forrest W. Crawford,et al.  Unifying the spatial epidemiology and molecular evolution of emerging epidemics , 2012, Proceedings of the National Academy of Sciences.

[22]  K. Verstrepen,et al.  Interspecific hybridization facilitates niche adaptation in beer yeast , 2019, Nature Ecology & Evolution.

[23]  F. Netter,et al.  Supplemental References , 2002, We Came Naked and Barefoot.

[24]  D. Réale,et al.  Personality and the emergence of the pace-of-life syndrome concept at the population level , 2010, Philosophical Transactions of the Royal Society B: Biological Sciences.

[25]  F J Rohlf,et al.  COMPARATIVE METHODS FOR THE ANALYSIS OF CONTINUOUS VARIABLES: GEOMETRIC INTERPRETATIONS , 2001, Evolution; international journal of organic evolution.

[26]  Babak Shahbaba,et al.  A Bayesian supervised dual‐dimensionality reduction model for simultaneous decoding of LFP and spike train signals , 2017, Stat.

[27]  Stephen G. Walker,et al.  Label Switching in Bayesian Mixture Models: Deterministic Relabeling Strategies , 2014 .

[28]  D. Adams,et al.  A METHOD FOR ASSESSING PHYLOGENETIC LEAST SQUARES MODELS FOR SHAPE AND OTHER HIGH‐DIMENSIONAL MULTIVARIATE DATA , 2014, Evolution; international journal of organic evolution.

[29]  Peter D. Hoff,et al.  Simulation of the Matrix Bingham–von Mises–Fisher Distribution, With Applications to Multivariate and Relational Data , 2007, 0712.4166.

[30]  A. Rosenberger,et al.  Modeling lineage and phenotypic diversification in the New World monkey (Platyrrhini, Primates) radiation. , 2015, Molecular phylogenetics and evolution.

[31]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[32]  Xiang Ji,et al.  Relaxed Random Walks at Scale. , 2019, Systematic biology.

[33]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[34]  David K. Smith,et al.  ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data , 2017 .

[35]  Guy Baele,et al.  Efficient Bayesian inference of general Gaussian models on large phylogenetic trees , 2020, The Annals of Applied Statistics.

[36]  S. R. Searle,et al.  The estimation of environmental and genetic trends from records subject to culling. , 1959 .

[37]  M. Girolami,et al.  Geodesic Monte Carlo on Embedded Manifolds , 2013, Scandinavian journal of statistics, theory and applications.

[38]  K. Kay,et al.  The Role of Animal Pollination in Plant Speciation: Integrating Ecology, Geography, and Genetics , 2009 .

[39]  C. Ané,et al.  A linear-time algorithm for Gaussian and non-Gaussian trait evolution models. , 2014, Systematic biology.

[40]  C. J-F,et al.  THE COALESCENT , 1980 .

[41]  Gilles Celeux,et al.  Bayesian Inference for Mixture: The Label Switching Problem , 1998, COMPSTAT.

[42]  J. Felsenstein Phylogenies and the Comparative Method , 1985, The American Naturalist.

[43]  D. Adams A generalized K statistic for estimating phylogenetic signal from shape and other high-dimensional multivariate data. , 2014, Systematic biology.

[44]  Susanne A. Fritz,et al.  Geographical variation in predictors of mammalian extinction risk: big is bad, but only in the tropics. , 2009, Ecology letters.

[45]  Kate E. Jones,et al.  PanTHERIA: a species‐level database of life history, ecology, and geography of extant and recently extinct mammals , 2009 .

[46]  Linda R Petzold,et al.  General Bayesian Inference over the Stiefel Manifold via the Givens Representation , 2017 .

[47]  Andrew Gelman,et al.  Handbook of Markov Chain Monte Carlo , 2011 .

[48]  Kensuke Okada,et al.  Post-processing of Markov chain Monte Carlo output in Bayesian latent variable models with application to multidimensional scaling , 2018, Comput. Stat..

[49]  Lam Si Tung Ho,et al.  Inferring Phenotypic Trait Evolution on Large Trees With Many Incomplete Measurements , 2019, Journal of the American Statistical Association.

[50]  J. Geweke,et al.  Measuring the pricing error of the arbitrage pricing theory , 1996 .

[51]  Trevor Bedford,et al.  ASSESSING PHENOTYPIC CORRELATION THROUGH THE MULTIVARIATE PHYLOGENETIC LATENT LIABILITY MODEL. , 2014, The annals of applied statistics.

[52]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[53]  D. Adams Quantifying and comparing phylogenetic evolutionary rates for shape and other high-dimensional phenotypic data. , 2014, Systematic biology.

[54]  Eric R. Pianka,et al.  On r- and K-Selection , 1970, The American Naturalist.

[55]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[56]  D. Dunson,et al.  Sparse Bayesian infinite factor models. , 2011, Biometrika.

[57]  S. R. Searle,et al.  On Deriving the Inverse of a Sum of Matrices , 1981 .

[58]  Peter D. Hoff,et al.  Monte Carlo Simulation on the Stiefel Manifold via Polar Expansion , 2019, J. Comput. Graph. Stat..

[59]  R. Salguero‐Gómez Applications of the fast-slow continuum and reproductive strategy framework of plant life histories. , 2017, The New phytologist.

[60]  Tanja Stadler,et al.  Fast likelihood calculation for multivariate Gaussian phylogenetic models with shifts. , 2019, Theoretical population biology.

[61]  S. Wroe,et al.  Variation in the strength of allometry drives rates of evolution in primate brain shape , 2020, Proceedings of the Royal Society B.

[62]  Babak Shahbaba,et al.  Bayesian Inference on Matrix Manifolds for Linear Dimensionality Reduction , 2016, 1606.04478.

[63]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[64]  Inaya Lima,et al.  Brain shape convergence in the adaptive radiation of New World monkeys , 2016, Proceedings of the National Academy of Sciences.

[65]  Guy Baele,et al.  Domestication and Divergence of Saccharomyces cerevisiae Beer Yeasts , 2006, Cell.