Petabase-scale sequence alignment catalyses viral discovery

Public sequence data represents a major opportunity for viral discovery, but its exploration has been inhibited by a lack of efficient methods for searching this corpus, which is currently at the petabase scale and growing exponentially. To address the ongoing pandemic caused by Severe Acute Respiratory Syndrome Coronavirus 2 and expand the known sequence diversity of viruses, we aligned pangenomes for coronaviruses (CoV) and other viral families to 5.6 petabases of public sequencing data from 3.8 million biologically diverse samples. To implement this strategy, we developed a cloud computing architecture, Serratus, tailored for ultra-high throughput sequence alignment at the petabase scale. From this search, we identified and assembled thousands of CoV and CoV-like genomes and genome fragments ranging from known strains to putatively novel genera. We generalise this strategy to other viral families, identifying several novel deltaviruses and huge bacteriophages. To catalyse a new era of viral discovery we made millions of viral alignments and family identifications freely available to the research community. Expanding the known diversity and zoonotic reservoirs of CoV and other emerging pathogens can accelerate vaccine and therapeutic developments for the current pandemic, and help us anticipate and mitigate future ones.

[1]  Astrid Gall,et al.  IVA: accurate de novo assembly of RNA virus genomes , 2015, Bioinform..

[2]  Olivier Gascuel,et al.  Renewing Felsenstein’s Phylogenetic Bootstrap in the Era of Big Data , 2018, Nature.

[3]  James T. Robinson,et al.  Variant Review with the Integrative Genomics Viewer. , 2017, Cancer research.

[4]  John C. Wooley,et al.  Ultrafast clustering algorithms for metagenomic sequence analysis , 2012, Briefings Bioinform..

[5]  K. Katz,et al.  STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions , 2021, Genome biology.

[6]  Andreas R. Pfenning,et al.  Broad Host Range of SARS-CoV-2 Predicted by Comparative and Structural Analysis of ACE2 in Vertebrates , 2020, Proceedings of the National Academy of Sciences.

[7]  Stephen P. Luby,et al.  A Strategy To Estimate Unknown Viral Diversity in Mammals , 2013, mBio.

[8]  M. Shi,et al.  A Divergent Hepatitis D-Like Agent in Birds , 2018, bioRxiv.

[9]  J. Casey,et al.  Hepatitis delta virus-like circular RNAs from diverse metazoans encode conserved hammerhead ribozymes , 2021, Virus evolution.

[10]  Edward C. Uberbacher,et al.  Gene and translation initiation site prediction in metagenomic sequences , 2012, Bioinform..

[11]  M. Shi,et al.  Novel hepatitis D-like agents in vertebrates and invertebrates , 2019, Virus evolution.

[12]  Hing-Fung Ting,et al.  MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. , 2016, Methods.

[13]  Michael Zuker,et al.  Mfold web server for nucleic acid folding and hybridization prediction , 2003, Nucleic Acids Res..

[14]  S. Dowd,et al.  Genome-Wide Polymorphism and Comparative Analyses in the White-Tailed Deer (Odocoileus virginianus): A Model for Conservation Genomics , 2011, PloS one.

[15]  Eroma Abeysinghe,et al.  Searching the Sequence Read Archive using Jetstream and Wrangler , 2018, PEARC.

[16]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[17]  Arno Klein,et al.  Assessment of the impact of shared brain imaging data on the scientific literature , 2018, Nature Communications.

[18]  Christine K. Johnson,et al.  Global shifts in mammalian population trends reveal key predictors of virus spillover risk , 2020, Proceedings of the Royal Society B.

[19]  N. Kyrpides,et al.  CheckV assesses the quality and completeness of metagenome-assembled viral genomes , 2020, Nature Biotechnology.

[20]  Hosein Mohimani,et al.  BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs , 2019, Genome research.

[21]  P. Balvanera,et al.  Pervasive human-driven decline of life on Earth points to the need for transformative change , 2019, Science.

[22]  H. Drost,et al.  Sensitive protein alignments at tree-of-life scale using DIAMOND , 2021, Nature Methods.

[23]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[24]  Alexey M. Kozlov,et al.  RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference , 2018, bioRxiv.

[25]  P. Talbot,et al.  Corona- and related viruses : current concepts in molecular biology and pathogenesis , 1995 .

[26]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[27]  L. Duret,et al.  Comparative population genomics in animals uncovers the determinants of genetic diversity , 2014, Nature.

[28]  M. Shi,et al.  A new lineage of segmented RNA viruses infecting animals , 2019, bioRxiv.

[29]  George M. Weinstock,et al.  Sequence Analysis of the Human Virome in Febrile and Afebrile Children , 2012, PloS one.

[30]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[31]  I. Hajirasouliha,et al.  coronaSPAdes: from biosynthetic gene clusters to RNA viral assemblies , 2020, bioRxiv.

[32]  R. Orton,et al.  Diversification of mammalian deltaviruses by host shifting , 2020, Proceedings of the National Academy of Sciences.

[33]  Peter D. Karp,et al.  A systematic comparison of the MetaCyc and KEGG pathway databases , 2013, BMC Bioinformatics.

[34]  R. Johnston,et al.  Synthetic recombinant bat SARS-like coronavirus is infectious in cultured cells and in mice , 2008, Proceedings of the National Academy of Sciences.

[35]  P. Bieniasz,et al.  Reconstitution of an Infectious Human Endogenous Retrovirus , 2007, PLoS pathogens.

[36]  Reza Assadi,et al.  The global burden of viral hepatitis from 1990 to 2013: findings from the Global Burden of Disease Study 2013 , 2016, The Lancet.

[37]  I-Min A. Chen,et al.  The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities , 2020, Nucleic Acids Res..

[38]  Kazutaka Katoh,et al.  Parallelization of MAFFT for large-scale multiple sequence alignments , 2018, Bioinform..

[39]  M. Buti,et al.  Long‐term clinical outcomes in patients with chronic hepatitis delta: the role of persistent viraemia , 2019, Alimentary pharmacology & therapeutics.

[40]  René L. Warren,et al.  The Sensitivity of Massively Parallel Sequencing for Detecting Candidate Infectious Agents Associated with Human Tissue , 2011, PloS one.

[41]  Kate E. Jones,et al.  Global trends in emerging infectious diseases , 2008, Nature.

[42]  S. Tong,et al.  Broad-Range Virus Detection and Discovery Using Microfluidic PCR Coupled with High-throughput Sequencing , 2020, bioRxiv.

[43]  Alexey M. Kozlov,et al.  ParGenes: a tool for massively parallel model selection and phylogenetic tree inference on thousands of genes , 2018, bioRxiv.

[44]  Phelim Bradley,et al.  Ultra-fast search of all deposited bacterial and viral genomic data , 2019, Nature Biotechnology.

[45]  Yan Li,et al.  SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation , 2016, PloS one.

[46]  R. Purcell,et al.  Hepatocarcinogenicity of the woodchuck hepatitis virus. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[47]  C. Balakrishnan,et al.  Transcriptional response to West Nile virus infection in the zebra finch (Taeniopygia guttata) , 2017, Royal Society Open Science.

[48]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[49]  Sean R. Eddy,et al.  Infernal 1.1: 100-fold faster RNA homology searches , 2013, Bioinform..

[50]  H. Debat Expanding the size limit of RNA viruses: Evidence of a novel divergent nidovirus in California sea hare, with a ~35.9 kb virus genome , 2018, bioRxiv.

[51]  Eric P. Nawrocki,et al.  VADR: validation and annotation of virus sequence submissions to GenBank , 2019, bioRxiv.

[52]  E. Koonin,et al.  Global Organization and Proposed Megataxonomy of the Virus World , 2020, Microbiology and Molecular Biology Reviews.

[53]  Vincent J. Denef,et al.  A genomic catalog of Earth’s microbiomes , 2020, Nature Biotechnology.

[54]  R. Finn,et al.  Massive expansion of human gut bacteriophage diversity , 2020, Cell.

[55]  N. Beeching,et al.  Pyrexia of unknown origin. , 2018, Clinical medicine.

[56]  Robert D. Finn,et al.  MGnify: the microbiome analysis resource in 2020 , 2019, Nucleic Acids Res..

[57]  E. Koonin,et al.  Origins and Evolution of the Global RNA Virome , 2018, mBio.

[58]  I-Min A. Chen,et al.  Genomes OnLine Database (GOLD) v.8: overview and updates , 2020, Nucleic Acids Res..

[59]  Brian C. Thomas,et al.  Megaphages infect Prevotella and variants are widespread in gut microbiomes , 2018, bioRxiv.

[60]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[61]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[62]  Alexey M. Kozlov,et al.  RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference , 2019, Bioinform..

[63]  K. Katz,et al.  STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions , 2021, Genome Biology.

[64]  P. Daszak,et al.  The Global Virome Project , 2018, Science.

[65]  G. González-Aseguinolaza,et al.  Animal Models of Chronic Hepatitis Delta Virus Infection Host–Virus Immunologic Interactions , 2015, Pathogens.

[66]  Christine L. Sun,et al.  Clades of huge phages from across Earth’s ecosystems , 2020, Nature.

[67]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[68]  Dmitry Antipov,et al.  Metaviral SPAdes: assembly of viruses from metagenomic data , 2020, Bioinform..

[69]  Felix May,et al.  Ecosystem decay exacerbates biodiversity loss with habitat loss , 2020, Nature.

[70]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[71]  Alejandro A. Schäffer,et al.  A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[72]  Proceedings of the Practice and Experience on Advanced Research Computing , 2018, PEARC.

[73]  F. Negro,et al.  Hepatitis delta virus (HDV) and woodchuck hepatitis virus (WHV) nucleic acids in tissues of HDV-infected chronic WHV carrier woodchucks , 1989, Journal of virology.

[74]  B. W. Erickson,et al.  Structural basis of the oligomerization of hepatitis delta antigen. , 1998, Structure.

[75]  Benoit Morel,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018, bioRxiv.

[76]  S. Sawicki,et al.  Coronaviruses use discontinuous extension for synthesis of subgenome-length negative strands. , 1995, Advances in experimental medicine and biology.

[77]  B. Neuman,et al.  Description and initial characterization of metatranscriptomic nidovirus-like genomes from the proposed new family Abyssoviridae, and from a sister group to the Coronavirinae, the proposed genus Alphaletovirus , 2018, Virology.

[78]  D. Melton,et al.  Blastemal progenitors modulate immune signaling during early limb regeneration , 2019, Development.

[79]  Robert C. Edgar,et al.  UCHIME2: improved chimera prediction for amplicon sequencing , 2016, bioRxiv.

[80]  Andrea Marzi,et al.  Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses , 2020, Nature Microbiology.

[81]  V. Bansal,et al.  Genome-wide association study results for educational attainment aid in identifying genetic heterogeneity of schizophrenia , 2018, Nature Communications.

[82]  J. Hepojoki,et al.  Snake Deltavirus Utilizes Envelope Proteins of Different Viruses To Generate Infectious Particles , 2019, mBio.

[83]  A. Thompson,et al.  Structural basis for proteolysis‐dependent activation of the poliovirus RNA‐dependent RNA polymerase , 2004, The EMBO journal.

[84]  K. Wada,et al.  Identification of novel avian and mammalian deltaviruses provides new insights 1 into deltavirus evolution 2 3 , 2020 .

[85]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[86]  Peter B. McGarvey,et al.  UniProt: the universal protein knowledgebase in 2021 , 2020, Nucleic Acids Res..

[87]  P. Daszak,et al.  Comparative analysis of rodent and small mammal viromes to better understand the wildlife origin of emerging infectious diseases , 2018, Microbiome.

[88]  P. Britton Pyrexia of unknown origin , 2013, Journal of paediatrics and child health.

[89]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[90]  R. Orton,et al.  Satellite virus diversification through host shifting revealed by novel deltaviruses in vampire bats , 2020 .

[91]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[92]  R. Orton,et al.  Demographic and environmental drivers of metagenomic viral diversity in vampire bats , 2019, Molecular ecology.

[93]  Adair L. Borges,et al.  Wide distribution of alternatively coded Lak megaphages in animal microbiomes , 2021, bioRxiv.

[94]  S. Elena,et al.  Viroids: survivors from the RNA world? , 2014, Annual review of microbiology.

[95]  D. Allaway,et al.  Rapid Reconstitution of the Fecal Microbiome after Extended Diet-Induced Changes Indicates a Stable Gut Microbiome in Healthy Adult Dogs , 2020, Applied and Environmental Microbiology.

[96]  D. Melton,et al.  Midkine is a dual regulator of wound epidermis development and inflammation during the initiation of limb regeneration , 2020, eLife.

[97]  R. Plowright,et al.  Bat-borne virus diversity, spillover and emergence , 2020, Nature Reviews Microbiology.

[98]  Brent S. Pedersen,et al.  Mosdepth: quick coverage calculation for genomes and exomes , 2017, bioRxiv.

[99]  Suzanna E Lewis,et al.  JBrowse: a dynamic web platform for genome visualization and analysis , 2016, Genome Biology.

[100]  E. Koonin,et al.  Doubling of the known set of RNA viruses by metagenomic analysis of an aquatic virome , 2020, Nature Microbiology.

[101]  M. Shi,et al.  The evolutionary history of vertebrate RNA viruses , 2018, Nature.

[102]  M. Müller,et al.  Mammalian deltavirus without hepadnavirus coinfection in the neotropical rodent Proechimys semispinosus , 2020, Proceedings of the National Academy of Sciences.

[103]  Anton Nekrutenko,et al.  No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics , 2020, bioRxiv.

[104]  Mélanie Courtot,et al.  BioSamples database: FAIRer samples metadata to accelerate research data management , 2021, Nucleic Acids Res..

[105]  Gunnar Rätsch,et al.  MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale , 2020, bioRxiv.

[106]  F. Rohwer,et al.  Metagenomics and future perspectives in virus discovery , 2012, Current Opinion in Virology.

[107]  Alexey M. Kozlov,et al.  ModelTest-NG: A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models , 2019, bioRxiv.

[108]  Alexander E Gorbalenya,et al.  Mechanisms and enzymes involved in SARS coronavirus genome expression. , 2003, The Journal of general virology.

[109]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[110]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[111]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[112]  Johnm . Taylor Infection by Hepatitis Delta Virus , 2020, Viruses.

[113]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[114]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[115]  Elena Bushmanova,et al.  rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data , 2018, bioRxiv.

[116]  A. Stamatakis,et al.  Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data , 2019, bioRxiv.

[117]  A. Stamatakis,et al.  Genesis and Gappa: Library and Toolkit for Working with Phylogenetic (Placement) Data , 2019 .

[118]  Reinhold Carle,et al.  Life history shapes variation in egg composition in the blue tit Cyanistes caeruleus , 2019, Communications Biology.

[119]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[120]  D. Swinney,et al.  Intrahepatic Transcriptional Signature Associated with Response to Interferon-α Treatment in the Woodchuck Model of Chronic Hepatitis B , 2015, PLoS pathogens.

[121]  Lu Sun,et al.  NCBI Taxonomy: a comprehensive update on curation, resources and tools , 2020, Database J. Biol. Databases Curation.

[122]  A. Fire,et al.  An Extensive Meta-Metagenomic Search Identifies SARS-CoV-2-Homologous Sequences in Pangolin Lung Viromes , 2020, mSphere.

[123]  C. Suttle,et al.  Endangered wild salmon infected by newly discovered viruses , 2019, eLife.

[124]  Eugene V. Koonin,et al.  Virus World as an Evolutionary Network of Viruses and Capsidless Selfish Elements , 2014, Microbiology and Molecular Reviews.

[125]  Tom O. Delmont,et al.  VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses , 2021, Microbiome.

[126]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[127]  Robert C. Edgar,et al.  Ribovirus classification by a polymerase barcode sequence , 2021, bioRxiv.

[128]  M. Lai,et al.  Mouse Hepatitis Virus Strain JHM Infects a Human Hepatocellular Carcinoma Cell Line , 1999, Virology.