A computational toolset for rapid identification of SARS-CoV-2, other viruses and microorganisms from sequencing data

In this paper, we present a toolset and related resources for rapid identification of viruses and microorganisms from short-read or long-read sequencing data. We present fastv as an ultra-fast tool to detect microbial sequences present in sequencing data, identify target microorganisms, and visualize coverage of microbial genomes. This tool is based on the k-mer mapping and extension method. K-mer sets are generated by UniqueKMER, another tool provided in this toolset. UniqueKMER can generate complete sets of unique k-mers for each genome within a large set of viral or microbial genomes. For convenience, unique k-mers for microorganisms and common viruses that afflict humans have been generated and are provided with the tools. As a lightweight tool, fastv accepts FASTQ data as input, and directly outputs the results in both HTML and JSON formats. Prior to the k-mer analysis, fastv automatically performs adapter trimming, quality pruning, base correction, and other pre-processing to ensure the accuracy of k-mer analysis. Specifically, fastv provides built-in support for rapid SARS-CoV-2 identification and typing. Experimental results showed that fastv achieved 100% sensitivity and 100% specificity for detecting SARS-CoV-2 from sequencing data; and can distinguish SARS-CoV-2 from SARS, MERS, and other coronaviruses. This toolset is available at: https://github.com/OpenGene/fastv.

[1]  F. Rohwer,et al.  Metagenomics and future perspectives in virus discovery , 2012, Current Opinion in Virology.

[2]  Joseph L DeRisi,et al.  Actionable diagnosis of neuroleptospirosis by next-generation sequencing. , 2014, The New England journal of medicine.

[3]  Derrick E. Wood,et al.  Improved metagenomic analysis with Kraken 2 , 2019, Genome Biology.

[4]  D. Cummings,et al.  Hospital outbreak of Middle East respiratory syndrome coronavirus. , 2013, The New England journal of medicine.

[5]  Madhura Purnaprajna,et al.  k-Core: Hardware Accelerator for k-Mer Generation and Counting used in Computational Genomics , 2019, 2019 32nd International Conference on VLSI Design and 2019 18th International Conference on Embedded Systems (VLSID).

[6]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[7]  F. Ryan,et al.  SPINGO: a rapid species-classifier for microbial amplicon sequences , 2015, BMC Bioinformatics.

[8]  J. A. Comer,et al.  A novel coronavirus associated with severe acute respiratory syndrome. , 2003, The New England journal of medicine.

[9]  The Global Macroeconomic Impacts of COVID-19: Seven Scenarios , 2021, Asian Economic Papers.

[10]  Warwick McKibbin,et al.  The Global Macroeconomic Impacts of COVID-19: Seven Scenarios , 2020, Asian Economic Papers.

[11]  Heike Sichtig,et al.  FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science , 2019, Nature Communications.

[12]  G. Gao,et al.  A Novel Coronavirus from Patients with Pneumonia in China, 2019 , 2020, The New England journal of medicine.

[13]  R. Bollmann,et al.  Human papillomavirus (HPV) study of 2916 cytological samples by PCR and DNA sequencing: genotype spectrum of patients from the west German area. , 2004, Journal of medical microbiology.

[14]  Xuelong Li,et al.  A survey of graph edit distance , 2010, Pattern Analysis and Applications.

[15]  P. Mieczkowski,et al.  Practical innovations for high-throughput amplicon sequencing , 2013, Nature Methods.

[16]  Yi Fan,et al.  Bat Coronaviruses in China , 2019, Viruses.

[17]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[18]  Andrew J. Davison,et al.  Consensus statement: Virus taxonomy in the age of metagenomics , 2017, Nature Reviews Microbiology.

[19]  Tiexiang Wen,et al.  MutScan: fast detection and visualization of target mutations by scanning FASTQ data , 2018, BMC Bioinformatics.

[20]  E. Holmes,et al.  The proximal origin of SARS-CoV-2 , 2020, Nature Medicine.

[21]  E. Holmes,et al.  A new coronavirus associated with human respiratory disease in China , 2020, Nature.

[22]  Parham Habibzadeh,et al.  Temperature, Humidity and Latitude Analysis to Predict Potential Spread and Seasonality for COVID-19. , 2020, SSRN.

[23]  P. Flick,et al.  Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems , 2016, BCB.

[24]  Jia Gu,et al.  GeneFuse: detection and visualization of target gene fusions from DNA sequencing data , 2018, International journal of biological sciences.

[25]  Silvia Angeletti,et al.  The 2019‐new coronavirus epidemic: Evidence for virus evolution , 2020, Journal of medical virology.

[26]  T. Tatusova,et al.  NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2006, Nucleic Acids Research.

[27]  Nuno Fernandes,et al.  Economic Effects of Coronavirus Outbreak (COVID-19) on the World Economy , 2020, SSRN Electronic Journal.

[28]  Yiming Bao,et al.  NCBI Viral Genomes Resource , 2014, Nucleic Acids Res..

[29]  Matthias Meyer,et al.  Illumina sequencing library preparation for highly multiplexed target capture and sequencing. , 2010, Cold Spring Harbor protocols.

[30]  Elisabeth Mahase,et al.  Coronavirus: covid-19 has killed more people than SARS and MERS combined, despite lower case fatality rate , 2020, BMJ.

[31]  Bill Gates,et al.  Responding to Covid-19 - A Once-in-a-Century Pandemic? , 2020, The New England journal of medicine.

[32]  Tony Pan,et al.  Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Jia Gu,et al.  fastp: an ultra-fast all-in-one FASTQ preprocessor , 2018, bioRxiv.

[34]  E. Holmes,et al.  A Genomic Perspective on the Origin and Emergence of SARS-CoV-2 , 2020, Cell.

[35]  S. Lo,et al.  A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster , 2020, The Lancet.

[36]  Jon R. Armstrong,et al.  Hybrid capture and next-generation sequencing identify viral integration sites from formalin-fixed, paraffin-embedded tissue. , 2011, The Journal of molecular diagnostics : JMD.

[37]  Jeffrey B. Joy,et al.  Theoretical and experimental assessment of degenerate primer tagging in ultra-deep applications of next-generation sequencing , 2014, Nucleic acids research.

[38]  Mark Johnson,et al.  NCBI BLAST: a better web interface , 2008, Nucleic Acids Res..

[39]  Jennifer Lu,et al.  Improved metagenomic analysis with Kraken 2 , 2019, Genome Biology.

[40]  Lin Li,et al.  Whole-genome sequencing identifies recurrent mutations in hepatocellular carcinoma , 2013, Genome research.

[41]  Philipp Berens,et al.  The art of using t-SNE for single-cell transcriptomics , 2019, Nature Communications.

[42]  I. Ernberg,et al.  The role of repetitive DNA sequences in the size variation of Epstein-Barr virus (EBV) nuclear antigens, and the identification of different EBV isolates using RFLP and PCR analysis. , 1995, The Journal of general virology.

[43]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.