Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult

Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising all virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be possible. Finally, an automatic classification of the current sequences into sub-classes based on statistical criteria is also not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.

[1]  Yuelong Shu,et al.  GISAID: Global initiative on sharing all influenza data – from vision to reality , 2017, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[2]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[3]  B. Foley,et al.  Evolutionary history, potential intermediate animal host, and cross‐species analyses of SARS‐CoV‐2 , 2020, Journal of medical virology.

[4]  H. Guohu,et al.  Spread dynamics of SARS-CoV-2 epidemic in China: a phylogenetic analysis , 2020, medRxiv.

[5]  Alexandros Stamatakis,et al.  Phylogenetic Search Algorithms for Maximum Likelihood , 2010 .

[6]  Maurizio Zazzi,et al.  A novel methodology for large-scale phylogeny partition , 2011, Nature communications.

[7]  F. Balloux,et al.  Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 , 2020, Infection, Genetics and Evolution.

[8]  A. Stamatakis,et al.  Automated, phylogeny-based genotype delimitation of the Hepatitis Viruses HBV and HCV , 2019, PeerJ.

[9]  Alexandros Stamatakis,et al.  A fast and memory-efficient implementation of the transfer bootstrap , 2019, Bioinformatics.

[10]  Alexey M. Kozlov,et al.  A fast and memory-efficient implementation of the transfer bootstrap , 2019, bioRxiv.

[11]  A. Salas,et al.  Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders , 2020, Genome research.

[12]  A. Stamatakis,et al.  Root Digger: a root placement program for phylogenetic trees , 2020, BMC Bioinformatics.

[13]  Alexey M. Kozlov,et al.  RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference , 2018, bioRxiv.

[14]  Darren L. Smith,et al.  Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region, January to June 2020 , 2020, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[15]  W. Hanage,et al.  Phylogenetic interpretation during outbreaks requires caution , 2020, Nature Microbiology.

[16]  Ziding Zhang,et al.  Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins , 2020, Nature.

[17]  Alexandros Stamatakis,et al.  ModelTest-NG: a new and scalable tool for the selection of DNA and protein evolutionary models , 2019 .

[18]  P. Lemey,et al.  Temporal signal and the phylodynamic threshold of SARS-CoV-2 , 2020, bioRxiv.

[19]  J. Rougemont,et al.  A rapid bootstrap algorithm for the RAxML Web servers. , 2008, Systematic biology.

[20]  Benoit Morel,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018, bioRxiv.

[21]  Gintaras Deikus,et al.  Introductions and early spread of SARS-CoV-2 in the New York City area , 2020, Science.

[22]  Jianguo Wu,et al.  Composition and divergence of coronavirus spike proteins and host ACE2 receptors predict potential intermediate hosts of SARS‐CoV‐2 , 2020, Journal of medical virology.

[23]  M. Suchard,et al.  Accommodating individual travel history, global mobility, and unsampled diversity in phylogeography: a SARS-CoV-2 case study. , 2020, bioRxiv.

[24]  R. Nielsen,et al.  Assessing Uncertainty in the Rooting of the SARS-CoV-2 Phylogeny , 2020, bioRxiv.

[25]  A. Rodrigo,et al.  Likelihood-based tests of topologies in phylogenetics. , 2000, Systematic biology.

[26]  Rob DeSalle,et al.  How many genes should a systematist sample? Conflicting insights from a phylogenomic matrix characterized by replicated incongruence. , 2007, Systematic biology.

[27]  A. Brufsky Distinct viral clades of SARS‐CoV‐2: Implications for modeling of viral spread , 2020, Journal of medical virology.

[28]  Nathan M. Young,et al.  Primate molecular divergence dates. , 2006, Molecular phylogenetics and evolution.

[29]  D. Montefiori,et al.  Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2 , 2020, bioRxiv.

[30]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[31]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[32]  E. Holmes,et al.  The proximal origin of SARS-CoV-2 , 2020, Nature Medicine.

[33]  Jason D. Fernandes,et al.  Stability of SARS-CoV-2 phylogenies , 2020, bioRxiv.

[34]  J. Glenn Morris,et al.  Collection of SARS-CoV-2 Virus from the Air of a Clinic within a University Student Health Care Center and Analyses of the Viral Genomic Sequence , 2020 .

[35]  Hidetoshi Shimodaira,et al.  Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference , 1999, Molecular Biology and Evolution.

[36]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[37]  Kari Stefansson,et al.  Spread of SARS-CoV-2 in the Icelandic Population , 2020, The New England journal of medicine.

[38]  G. Whittaker,et al.  Phylogenetic Analysis and Structural Modeling of SARS-CoV-2 Spike Protein Reveals an Evolutionary Distinct and Proteolytically Sensitive Activation Loop , 2020, Journal of Molecular Biology.

[39]  Alexandros Stamatakis,et al.  Methods for automatic reference trees and multilevel phylogenetic placement , 2018, bioRxiv.

[40]  Edward C. Holmes,et al.  A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology , 2020, bioRxiv.

[41]  Nuno R. Faria,et al.  A Genomic Survey of SARS-CoV-2 Reveals Multiple Introductions into Northern California without a Predominant Lineage , 2020, medRxiv.

[42]  Alexey M. Kozlov,et al.  ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes , 2018, bioRxiv.

[43]  David L Robertson,et al.  No evidence for distinct types in the evolution of SARS-CoV-2 , 2020, Virus evolution.

[44]  Alexandros Stamatakis,et al.  Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data , 2020, Bioinform..

[45]  A. von Haeseler,et al.  UFBoot2: Improving the Ultrafast Bootstrap Approximation , 2017, bioRxiv.

[46]  Trevor Bedford,et al.  Nextstrain: real-time tracking of pathogen evolution , 2017, bioRxiv.

[47]  Kai Zhao,et al.  A pneumonia outbreak associated with a new coronavirus of probable bat origin , 2020, Nature.

[48]  Jiajie Zhang,et al.  Multi-rate Poisson tree processes for single-locus species delimitation under maximum likelihood and Markov chain Monte Carlo , 2016, bioRxiv.

[49]  M. Salemi,et al.  A Snapshot of SARS-CoV-2 Genome Availability up to April 2020 and its Implications: Data Analysis , 2020, JMIR public health and surveillance.

[50]  E. Holmes,et al.  Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding , 2020, The Lancet.

[51]  A. Rambaut,et al.  Genomic epidemiology of SARS-CoV-2 spread in Scotland highlights the role of European travel in COVID-19 emergence , 2020, medRxiv.

[52]  Samantha Lycett,et al.  Automated analysis of phylogenetic clusters , 2013, BMC Bioinformatics.