Ultrafast Sample Placement on Existing Trees (UShER) Empowers Real-Time Phylogenetics for the SARS-CoV-2 Pandemic

As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering a new era of “genomic contact tracing” – that is, using viral genome sequences to trace local transmission dynamics. However, because the viral phylogeny is already so large – and will undoubtedly grow many fold – placing new sequences onto the tree has emerged as a barrier to real-time genomic contact tracing. Here, we resolve this challenge by building an efficient, tree-based data structure encoding the inferred evolutionary history of the virus. We demonstrate that our approach improves the speed of phylogenetic placement of new samples and data visualization by orders of magnitude, making it possible to complete the placements under real-time constraints. Our method also provides the key ingredient for maintaining a fully-updated reference phylogeny. We make these tools available to the research community through the UCSC SARS-CoV-2 Genome Browser to enable rapid cross-referencing of information in new virus sequences with an ever-expanding array of molecular and structural biology data. The methods described here will empower research and genomic contact tracing for laboratories worldwide. Software Availability USHER is available to users through the UCSC Genome Browser at https://genome.ucsc.edu/cgi-bin/hgPhyloPlace. The source code and detailed instructions on how to compile and run UShER are available from https://github.com/yatisht/usher.

[1]  Albert J. Vilella,et al.  Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm , 2012, Bioinform..

[2]  Etienne Simon-Loriere,et al.  Introductions and early spread of SARS-CoV-2 in France, 24 January to 23 March 2020 , 2020, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[3]  Kevin R. Thornton,et al.  Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes , 2019, Genetics.

[4]  Emmanuel Paradis,et al.  ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R , 2018, Bioinform..

[5]  Mikhail Prokopenko,et al.  Revealing COVID-19 transmission in Australia by SARS-CoV-2 genome sequencing and agent-based modeling , 2020, Nature Medicine.

[6]  Kevin R. Thornton,et al.  Efficient pedigree recording for fast population genetics simulation , 2018, bioRxiv.

[7]  E. Holmes,et al.  The proximal origin of SARS-CoV-2 , 2020, Nature Medicine.

[8]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[9]  Trevor Bedford,et al.  Cryptic transmission of SARS-CoV-2 in Washington state , 2020, Science.

[10]  Kai Zhao,et al.  A pneumonia outbreak associated with a new coronavirus of probable bat origin , 2020, Nature.

[11]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[12]  M. Gismondo,et al.  Whole genome and phylogenetic analysis of two SARS-CoV-2 strains isolated in Italy in January and February 2020: additional clues on multiple introductions and further circulation in Europe , 2020, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[13]  Matthew T. Maurano,et al.  Sequencing identifies multiple early introductions of SARS-CoV-2 to the New York City Region , 2020, medRxiv : the preprint server for health sciences.

[14]  Siavash Mirarab,et al.  TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees , 2018, BMC Genomics.

[15]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[16]  C. Simon,et al.  An Evolving View of Phylogenetic Support. , 2020, Systematic biology.

[17]  Isaac I. Bogoch,et al.  Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States , 2020, Cell.

[18]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[19]  Onur Mutlu,et al.  GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies , 2017, BMC Genomics.

[20]  O. Gascuel,et al.  Approximate likelihood-ratio test for branches: A fast, accurate, and powerful alternative. , 2006, Systematic biology.

[21]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[22]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[23]  Stephanie J. Spielman,et al.  Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies , 2015, bioRxiv.

[24]  Krzysztof Giaro,et al.  TreeCmp: Comparison of Trees in Polynomial Time , 2012, Evolutionary Bioinformatics Online.

[25]  David Robertson,et al.  CoV-GLUE: A Web Application for Tracking SARS-CoV-2 Genomic Variation , 2020 .

[26]  Trevor Bedford,et al.  Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California , 2020, Science.

[27]  Yunfan Fan,et al.  Genomic Diversity of SARS-CoV-2 During Early Introduction into the United States National Capital Region , 2020, medRxiv.

[28]  Dan Otelea,et al.  Molecular Epidemiology Analysis of SARS-CoV-2 Strains Circulating in Romania during the First Months of the Pandemic , 2020, Life.

[29]  Steven Skiena,et al.  Lowest common ancestors in trees and directed acyclic graphs , 2005, J. Algorithms.

[30]  Etienne Simon-Loriere,et al.  Introductions and early spread of SARS-CoV-2 in France, 24 January to 23 March 2020 , 2020, bioRxiv.

[31]  Jia-Fu Jiang,et al.  Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins , 2020, Nature.

[32]  Benoit Morel,et al.  Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult , 2020, bioRxiv.

[33]  Jason D. Fernandes,et al.  Stability of SARS-CoV-2 phylogenies , 2020, bioRxiv.

[34]  Tao Liu,et al.  TreeFam: 2008 Update , 2007, Nucleic Acids Res..

[35]  Benoit Morel,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018, bioRxiv.

[36]  Trevor Bedford,et al.  Cryptic transmission of SARS-CoV-2 in Washington state , 2020, Science.

[37]  Guy Baele,et al.  A Phylodynamic Workflow to Rapidly Gain Insights into the Dispersal History and Dynamics of SARS-CoV-2 Lineages , 2020, bioRxiv.

[38]  David Haussler,et al.  The UCSC SARS-CoV-2 Genome Browser , 2020, bioRxiv.

[39]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[40]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[41]  Yuelong Shu,et al.  GISAID: Global initiative on sharing all influenza data – from vision to reality , 2017, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[42]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[43]  D. Robinson,et al.  Comparison of weighted labelled trees , 1979 .

[44]  Olga Chernomor,et al.  IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era , 2020, Molecular biology and evolution.

[45]  Wenjun Ma,et al.  Genomic Epidemiology of SARS-CoV-2 in Guangdong Province, China , 2020, Cell.

[46]  A. von Haeseler,et al.  UFBoot2: Improving the Ultrafast Bootstrap Approximation , 2017, bioRxiv.

[47]  M. Venkataswamy,et al.  Genomic epidemiology reveals multiple introductions and spread of SARS-CoV-2 in the Indian state of Karnataka , 2020, medRxiv.

[48]  Minh Anh Nguyen,et al.  Ultrafast Approximation for Phylogenetic Bootstrap , 2013, Molecular biology and evolution.

[49]  Evgeny M. Zdobnov,et al.  The Newick utilities: high-throughput phylogenetic tree processing in the Unix shell , 2010, Bioinform..

[50]  Trevor Bedford,et al.  Nextstrain: real-time tracking of pathogen evolution , 2017, bioRxiv.