Comprehensive discovery of CRISPR-targeted terminally redundant sequences in the human gut metagenome: Viruses, plasmids, and more

Viruses are the most numerous biological entity, existing in all environments and infecting all cellular organisms. Compared with cellular life, the evolution and origin of viruses are poorly understood; viruses are enormously diverse, and most lack sequence similarity to cellular genes. To uncover viral sequences without relying on either reference viral sequences from databases or marker genes that characterize specific viral taxa, we developed an analysis pipeline for virus inference based on clustered regularly interspaced short palindromic repeats (CRISPR). CRISPR is a prokaryotic nucleic acid restriction system that stores the memory of previous exposure. Our protocol can infer CRISPR-targeted sequences, including viruses, plasmids, and previously uncharacterized elements, and predict their hosts using unassembled short-read metagenomic sequencing data. By analyzing human gut metagenomic data, we extracted 11,391 terminally redundant CRISPR-targeted sequences, which are likely complete circular genomes. The sequences included 2,154 tailed-phage genomes, together with 257 complete crAssphage genomes, 11 genomes larger than 200 kilobases, 766 genomes of Microviridae species, 56 genomes of Inoviridae species, and 95 previously uncharacterized circular small genomes that have no reliably predicted protein-coding gene. We predicted the host(s) of approximately 70% of the discovered genomes at the taxonomic level of phylum by linking protospacers to taxonomically assigned CRISPR direct repeats. These results demonstrate that our protocol is efficient for de novo inference of CRISPR-targeted sequences and their host prediction.

[1]  I. Tirosh,et al.  CRISPR targeting reveals a reservoir of common phages associated with the human gut microbiome , 2012, Genome research.

[2]  Roland Eils,et al.  Complex heatmaps reveal patterns and correlations in multidimensional genomic data , 2016, Bioinform..

[3]  E. Koonin,et al.  The ancient Virus World and evolution of cells , 2006, Biology Direct.

[4]  Chris M. Brown,et al.  CRISPRDetect: A flexible algorithm to define CRISPR arrays , 2016, BMC Genomics.

[5]  H. Ackermann Phage classification and characterization. , 2009, Methods in molecular biology.

[6]  Axel Poulet,et al.  Evolution and Diversity of the Microviridae Viral Family through a Collection of 81 New Complete Genomes Assembled from Virome Reads , 2012, PloS one.

[7]  R. Barrangou,et al.  CRISPR Provides Acquired Resistance Against Viruses in Prokaryotes , 2007, Science.

[8]  D. Gatherer,et al.  Correlation between bacterial G+C content, genome size and the G+C content of associated plasmids and bacteriophages , 2018, Microbial genomics.

[9]  F. Sanger,et al.  Nucleotide sequence of bacteriophage phi X174 DNA. , 1977, Nature.

[10]  Kira S. Makarova,et al.  Diversity and evolution of class 2 CRISPR–Cas systems , 2017, Nature Reviews Microbiology.

[11]  Anders F. Andersson,et al.  Virus Population Dynamics and Acquired Virus Resistance in Natural Microbial Communities , 2008, Science.

[12]  A. D. Hershey,et al.  INDEPENDENT FUNCTIONS OF VIRAL PROTEIN AND NUCLEIC ACID IN GROWTH OF BACTERIOPHAGE , 1952, The Journal of general physiology.

[13]  Christine L. Sun,et al.  Clades of huge phages from across Earth’s ecosystems , 2020, Nature.

[14]  Jos Boekhorst,et al.  Degenerate target sites mediate rapid primed CRISPR adaptation , 2014, Proceedings of the National Academy of Sciences.

[15]  Toni Gabaldón,et al.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses , 2009, Bioinform..

[16]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[17]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[18]  Natalia N. Ivanova,et al.  Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes , 2019, Nature Microbiology.

[19]  H. Ackermann Tailed Bacteriophages: The Order Caudovirales , 1998, Advances in Virus Research.

[20]  Georgios A. Pavlopoulos,et al.  Uncovering Earth’s virome , 2016, Nature.

[21]  Peter Goodfellow,et al.  Circular transcripts of the testis-determining gene Sry in adult mouse testis , 1993, Cell.

[22]  Connor T. Skennerton,et al.  Crass: identification and reconstruction of CRISPR from unassembled metagenomic data , 2013, Nucleic acids research.

[23]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[24]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[25]  Adam M. Phillippy,et al.  MUMmer4: A fast and versatile genome alignment system , 2018, PLoS Comput. Biol..

[26]  Natalia N. Ivanova,et al.  Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome , 2021, Nature Microbiology.

[27]  Kai Zhao,et al.  A pneumonia outbreak associated with a new coronavirus of probable bat origin , 2020, Nature.

[28]  C. Liang,et al.  MetaCRAST: reference-guided extraction of CRISPR spacers from unassembled metagenomes , 2017, PeerJ.

[29]  M. Bateson,et al.  Use of Cellular CRISPR (Clusters of Regularly Interspaced Short Palindromic Repeats) Spacer-Based Microarrays for Detection of Viruses in Environmental Samples , 2010, Applied and Environmental Microbiology.

[30]  E. Koonin,et al.  Multiple origins of prokaryotic and eukaryotic single-stranded DNA viruses from bacterial and archaeal plasmids , 2019, Nature Communications.

[31]  R. Knight,et al.  Diversity, stability and resilience of the human gut microbiota , 2012, Nature.

[32]  T. Tatusova,et al.  NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2006, Nucleic Acids Research.

[33]  I-Min A. Chen,et al.  IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses , 2016, Nucleic Acids Res..

[34]  Sergey A. Shmakov,et al.  Mapping CRISPR spaceromes reveals vast host-specific viromes of prokaryotes , 2020, Communications Biology.

[35]  Matthew B. Sullivan,et al.  VirSorter: mining viral signal from microbial genomic data , 2015, PeerJ.

[36]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[37]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[38]  K. Wommack,et al.  Virioplankton: Viruses in Aquatic Ecosystems , 2000, Microbiology and Molecular Biology Reviews.

[39]  The so far farthest reaches of the double jelly roll capsid protein fold , 2018 .

[40]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[41]  I-Min A. Chen,et al.  IMG/VR v.2.0: an integrated data management and analysis system for cultivated and environmental viral genomes , 2018, Nucleic Acids Res..

[42]  Robert A Edwards,et al.  Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut , 2017, Nature Microbiology.

[43]  C. San Martín,et al.  The so far farthest reaches of the double jelly roll capsid protein fold , 2018, Virology Journal.

[44]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[45]  P. Sharp,et al.  Origins of HIV and the AIDS pandemic. , 2011, Cold Spring Harbor perspectives in medicine.

[46]  R. Edwards,et al.  A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes , 2014, Nature Communications.

[47]  S. Dongen Graph clustering by flow simulation , 2000 .

[48]  Jeffrey E. Barrick,et al.  Evolution of satellite plasmids can prolong the maintenance of newly acquired accessory genes in bacteria , 2019, Nature Communications.

[49]  E. Koonin,et al.  Vast diversity of prokaryotic virus genomes encoding double jelly-roll major capsid proteins uncovered by genomic and metagenomic sequence analysis , 2018, Virology Journal.

[50]  R. McKenna,et al.  Microviridae, a Family Divided: Isolation, Characterization, and Genome Sequence of φMH2K, a Bacteriophage of the Obligate Intracellular Parasitic Bacterium Bdellovibrio bacteriovorus , 2002, Journal of bacteriology.

[51]  G. Streisinger,et al.  CHROMOSOME STRUCTURE IN PHAGE T4. I. CIRCULARITY OF THE LINKAGE MAP. , 1964, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Massimo Vergassola,et al.  Causes for the intriguing presence of tRNAs in phages. , 2007, Genome research.

[53]  Sita J. Saunders,et al.  An updated evolutionary classification of CRISPR–Cas systems , 2015, Nature Reviews Microbiology.

[54]  R. Contreras,et al.  Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene , 1976, Nature.

[55]  E. Koonin,et al.  Conservation of major and minor jelly-roll capsid proteins in Polinton (Maverick) transposons suggests that they are bona fide viruses , 2014, Biology Direct.

[56]  E. Koonin,et al.  Origin of viruses: primordial replicators recruiting capsids from hosts , 2019, Nature Reviews Microbiology.

[57]  Philippe Horvath,et al.  Phage Response to CRISPR-Encoded Resistance in Streptococcus thermophilus , 2007, Journal of bacteriology.

[58]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[59]  J. Conway,et al.  Capsids and Genomes of Jumbo-Sized Bacteriophages Reveal the Evolutionary Reach of the HK97 Fold , 2017, mBio.

[60]  Dean Laslett,et al.  ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. , 2004, Nucleic acids research.

[61]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[62]  Yang Young Lu,et al.  VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data , 2017, Microbiome.

[63]  Andrew Camilli,et al.  A bacteriophage encodes its own CRISPR/Cas adaptive response to evade host innate immunity , 2013, Nature.

[64]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2017, Nature Communications.

[65]  Johannes Söding,et al.  Linclust: clustering billions of protein sequences per day on a single server , 2017 .

[66]  M. E. Abdel-Haliem,et al.  Site-specific recombination systems in filamentous phages , 2012, Molecular Genetics and Genomics.

[67]  N. Rajewsky,et al.  circRNA biogenesis competes with pre-mRNA splicing. , 2014, Molecular cell.

[68]  G. Salmond,et al.  Type I-F CRISPR-Cas resistance against virulent phages results in abortive infection and provides population-level immunity , 2019, Nature Communications.

[69]  Kira S. Makarova,et al.  The CRISPR Spacer Space Is Dominated by Sequences from Species-Specific Mobilomes , 2017, mBio.

[70]  Jie Cui,et al.  An Allometric Relationship between the Genome Length and Virion Volume of Viruses , 2014, Journal of Virology.

[71]  F. Sanger,et al.  Nucleotide sequence of bacteriophage φX174 DNA , 1977, Nature.

[72]  Haixu Tang,et al.  CRISPR-Cas systems target a diverse collection of invasive mobile genetic elements in human microbiomes , 2013, Genome Biology.

[73]  M. Sullivan,et al.  The Gut Virome Database Reveals Age-Dependent Patterns of Virome Diversity in the Human Gut , 2020, Cell Host & Microbe.

[74]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.