Mapping the Shapes of Phylogenetic Trees from Human and Zoonotic RNA Viruses

A phylogeny is a tree-based model of common ancestry that is an indispensable tool for studying biological variation. Phylogenies play a special role in the study of rapidly evolving populations such as viruses, where the proliferation of lineages is constantly being shaped by the mode of virus transmission, by adaptation to immune systems, and by patterns of human migration and contact. These processes may leave an imprint on the shapes of virus phylogenies that can be extracted for comparative study; however, tree shapes are intrinsically difficult to quantify. Here we present a comprehensive study of phylogenies reconstructed from 38 different RNA viruses from 12 taxonomic families that are associated with human pathologies. To accomplish this, we have developed a new procedure for studying phylogenetic tree shapes based on the ‘kernel trick’, a technique that maps complex objects into a statistically convenient space. We show that our kernel method outperforms nine different tree balance statistics at correctly classifying phylogenies that were simulated under different evolutionary scenarios. Using the kernel method, we observe patterns in the distribution of RNA virus phylogenies in this space that reflect modes of transmission and pathogenesis. For example, viruses that can establish persistent chronic infections (such as HIV and hepatitis C virus) form a distinct cluster. Although the visibly ‘star-like’ shape characteristic of trees from these viruses has been well-documented, we show that established methods for quantifying tree shape fail to distinguish these trees from those of other viruses. The kernel approach presented here potentially represents an important new tool for characterizing the evolution and epidemiology of RNA viruses.

[1]  J. Drake,et al.  Rates of spontaneous mutation. , 1998, Genetics.

[2]  Erik M. Volz,et al.  Modelling tree shape and structure in viral phylodynamics , 2013, Philosophical Transactions of the Royal Society B: Biological Sciences.

[3]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[4]  J. Margolick,et al.  Persistent GB virus C infection and survival in HIV-infected men. , 2004, The New England journal of medicine.

[5]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[6]  M. J. Sackin,et al.  “Good” and “Bad” Phenograms , 1972 .

[7]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[8]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[9]  R. FitzJohn Diversitree: comparative phylogenetic analyses of diversification in R , 2012 .

[10]  J. Crothers,et al.  Good and the bad , 1986, Nature.

[11]  Katia Koelle,et al.  Phylodynamic Inference and Model Assessment with Approximate Bayesian Computation: Influenza as a Case Study , 2012, PLoS Comput. Biol..

[12]  D. H. Colless,et al.  Phylogenetics: The Theory and Practice of Phylogenetic Systematics. , 1982 .

[13]  M. Slatkin,et al.  SEARCHING FOR EVOLUTIONARY PATTERNS IN THE SHAPE OF A PHYLOGENETIC TREE , 1993, Evolution; international journal of organic evolution.

[14]  Andy Purvis,et al.  Evaluating phylogenetic tree shape: two modifications to Fusco & Cronk's method. , 2002, Journal of theoretical biology.

[15]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[16]  O. Pybus,et al.  Unifying the Epidemiological and Evolutionary Dynamics of Pathogens , 2004, Science.

[17]  Edward C. Holmes,et al.  Discovering the Phylodynamics of RNA Viruses , 2009, PLoS Comput. Biol..

[18]  Jonathan P. Bollback,et al.  Inferring the root of a phylogenetic tree. , 2002, Systematic biology.

[19]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[20]  Alessandro Moschitti,et al.  Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.

[21]  Mario A. Storti,et al.  MPI for Python: Performance improvements and MPI-2 extensions , 2008, J. Parallel Distributed Comput..

[22]  Amos Bairoch,et al.  ViralZone: a knowledge resource to understand virus diversity , 2010, Nucleic Acids Res..

[23]  Karina Yusim,et al.  The Los Alamos hepatitis C sequence database , 2005, Bioinform..

[24]  T. P. Hughes,et al.  A neurotropic virus isolated from the blood of a native of Uganda , 1940 .

[25]  Sergei L. Kosakovsky Pond,et al.  An Evolutionary Model-Based Algorithm for Accurate Phylogenetic Breakpoint Mapping and Subtype Prediction in HIV-1 , 2009, PLoS Comput. Biol..

[26]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[27]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[28]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[29]  Andy Purvis,et al.  Power of eight tree shape statistics to detect nonrandom diversification: a comparison by simulation of two models of cladogenesis. , 2002, Systematic biology.

[30]  E. Wiley Phylogenetics: The Theory and Practice of Phylogenetic Systematics , 1981 .

[31]  P. Sharp Origins of Human Virus Diversity , 2002, Cell.

[32]  Peter J. A. Cock,et al.  Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython , 2012, BMC Bioinformatics.

[33]  Vittorio Loreto,et al.  Phylogenetic Properties of RNA Viruses , 2012, PloS one.

[34]  Richard G FitzJohn,et al.  Quantitative traits and diversification. , 2010, Systematic biology.

[35]  Sergei L. Kosakovsky Pond,et al.  HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[36]  Jason Weston,et al.  Dealing with large diagonals in kernel matrices , 2003 .

[37]  Leonard J. Biallas Searching for “IT” , 1971 .

[38]  E. Holmes,et al.  The evolution of epidemic influenza , 2007, Nature Reviews Genetics.

[39]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[40]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[41]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[42]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[43]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[44]  Olivier François,et al.  On statistical tests of phylogenetic tree imbalance: the Sackin and other indices revisited. , 2005, Mathematical biosciences.

[45]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[46]  B. Korber,et al.  Evolutionary and immunological implications of contemporary HIV-1 variation. , 2001, British medical bulletin.