A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection

To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection. ABSTRACT Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection.

[1]  D. Vacante,et al.  Introduction and Workshop Summary: Advanced Technologies for Virus Detection in the Evaluation of Biologicals—Applications and Challenges , 2014, PDA Journal of Pharmaceutical Science and Technology.

[2]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[3]  D. Greenhalgh,et al.  ERE database: a database of genomic maps and biological properties of endogenous retroviral elements in the C57BL/6J mouse genome. , 2012, Genomics.

[4]  R. L. Harrison,et al.  50 years of the International Committee on Taxonomy of Viruses: progress and prospects , 2017, Archives of Virology.

[5]  Jianxin Ma,et al.  SoyTEdb: a comprehensive database of transposable elements in the soybean genome , 2010, BMC Genomics.

[6]  Jean-Baptiste Veyrieras,et al.  A comprehensive hybridization model allows whole HERV transcriptome profiling using high density microarray , 2017, BMC Genomics.

[7]  P. Chain,et al.  Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. , 2012, Current opinion in biotechnology.

[8]  Robert D. Finn,et al.  The Dfam database of repetitive DNA families , 2015, Nucleic Acids Res..

[9]  J. Sears,et al.  No Evidence of Infectious Retroviruses in Measles Virus Vaccines Produced in Chicken Embryo Cell Cultures , 2001, Journal of Clinical Microbiology.

[10]  L. Scobie,et al.  Porcine endogenous retrovirus – advances, issues and solutions , 2002, Xenotransplantation.

[11]  David S. Goodsell,et al.  The RCSB Protein Data Bank: new resources for research and education , 2012, Nucleic Acids Res..

[12]  Tian Xia,et al.  BmTEdb: a collective database of transposable elements in the silkworm genome , 2013, Database J. Biol. Databases Curation.

[13]  W. Heneine,et al.  Characterization of Endogenous Avian Leukosis Viruses in Chicken Embryonic Fibroblast Substrates Used in Production of Measles and Mumps Vaccines , 2001, Journal of Virology.

[14]  Arifa S Khan,et al.  Advanced Virus Detection Technologies Interest Group (AVDTIG): Efforts on High Throughput Sequencing (HTS) for Virus Detection , 2016, PDA Journal of Pharmaceutical Science and Technology.

[15]  Yasuhiro Takeuchi,et al.  Evidence and Consequence of Porcine Endogenous Retrovirus Recombination , 2004, Journal of Virology.

[16]  Wenbin Ma,et al.  Chemical Induction of Endogenous Retrovirus Particles from the Vero Cell Line of African Green Monkeys , 2011, Journal of Virology.

[17]  Pavel Neumann,et al.  Highly abundant pea LTR retrotransposon Ogre is constitutively transcribed and partially spliced , 2003, Plant Molecular Biology.

[18]  Deepak Sharma,et al.  Unraveling the Web of Viroinformatics: Computational Tools and Databases in Virus Research , 2014, Journal of Virology.

[19]  Crystal Jaing,et al.  Viral Nucleic Acids in Live-Attenuated Vaccines: Detection of Minority Variants and an Adventitious Virus , 2010, Journal of Virology.

[20]  D. Jarvis,et al.  Rhabdovirus-like endogenous viral elements in the genome of Spodoptera frugiperda insect cells are actively transcribed: Implications for adventitious virus detection. , 2016, Biologicals : journal of the International Association of Biological Standardization.

[21]  So Nakagawa,et al.  gEVE: a genome-based endogenous viral element database provides comprehensive viral protein-coding sequences in mammalian genomes , 2016, Database J. Biol. Databases Curation.

[22]  Elliot J. Lefkowitz,et al.  Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV) , 2017, Nucleic Acids Res..

[23]  Tian Li,et al.  MnTEdb, a collective resource for mulberry transposable elements , 2015, Database J. Biol. Databases Curation.

[24]  J. Jurka,et al.  A universal classification of eukaryotic transposable elements implemented in Repbase , 2008, Nature Reviews Genetics.

[25]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[26]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[27]  D. Onions,et al.  Ensuring the safety of vaccine cell substrates by massively parallel sequencing of the transcriptome. , 2011, Vaccine.

[28]  J. Böni,et al.  Reverse transcriptase activity in chicken embryo fibroblast culture supernatants is associated with particles containing endogenous avian retrovirus EAV-0 RNA , 1997, Journal of virology.

[29]  Jerzy Jurka,et al.  HERVd: the Human Endogenous RetroViruses Database: update , 2004, Nucleic Acids Res..

[30]  Y. Ilyin,et al.  The introduction of a transpositionally active copy of retrotransposon GYPSY into the Stable Strain of Drosophila melanogaster causes genetic instability , 1994, Molecular and General Genetics MGG.

[31]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[32]  Andrew J. Davison,et al.  Consensus statement: Virus taxonomy in the age of metagenomics , 2017, Nature Reviews Microbiology.

[33]  José M. Sempere,et al.  The Gypsy Database (GyDB) of mobile genetic elements: release 2.0 , 2010, Nucleic Acids Res..

[34]  G. Rohrmann,et al.  Diversity of errantivirus (retrovirus) sequences in two cell lines used for baculovirus expression, Spodoptera frugiperda and Trichoplusia ni , 2008, Virus Genes.

[35]  O. Kohany,et al.  Repbase Update, a database of repetitive elements in eukaryotic genomes , 2015, Mobile DNA.

[36]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[37]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[38]  Lian-Feng Gu,et al.  DPTEdb, an integrative database of transposable elements in dioecious plants , 2016, Database J. Biol. Databases Curation.

[39]  Arifa S. Khan,et al.  New Technologies and Challenges of Novel Virus Detection , 2014, PDA Journal of Pharmaceutical Science and Technology.

[40]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[41]  G. Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2011, Nucleic Acids Res..

[42]  Yiming Bao,et al.  NCBI Viral Genomes Resource , 2014, Nucleic Acids Res..

[43]  A. Lazcano,et al.  Viral Genome Size Distribution Does not Correlate with the Antiquity of the Host Lineages , 2015, Front. Ecol. Evol..

[44]  Hideaki Sugawara,et al.  DNA Data Bank of Japan (DDBJ) for genome scale research in life science , 2002, Nucleic Acids Res..

[45]  Arifa S. Khan,et al.  Identification of a Novel Rhabdovirus in Spodoptera frugiperda Cell Lines , 2014, Journal of Virology.

[46]  A. Pélisson,et al.  Retroviruses in invertebrates: the gypsy retrotransposon is apparently an infectious retrovirus of Drosophila melanogaster. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[47]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[48]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..