VADR: validation and annotation of virus sequence submissions to GenBank

Background GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions. Results We developed a system called VADR (Viral Annotation DefineR) that validates and annotates viral sequences in GenBank submissions. The annotation system is based on the analysis of the input nucleotide sequence using models built from curated RefSeqs. Hidden Markov models are used to classify sequences by determining the RefSeq they are most similar to, and feature annotation from the RefSeq is mapped based on a nucleotide alignment of the full sequence to a covariance model. Predicted proteins encoded by the sequence are validated with nucleotide-to-protein alignments using BLAST. The system identifies 43 types of “alerts” that (unlike the previous BLAST-based system) provide deterministic and rigorous feedback to researchers who submit sequences with unexpected characteristics. VADR has been integrated into GenBank’s submission processing pipeline allowing for viral submissions passing all tests to be accepted and annotated automatically, without the need for any human (GenBank indexer) intervention. Unlike the previous submission-checking system, VADR is freely available ( https://github.com/nawrockie/vadr ) for local installation and use. VADR has been used for Norovirus submissions since May 2018 and for Dengue virus submissions since January 2019. Since March 2020, VADR has also been used to check SARS-CoV-2 sequence submissions. Other viruses with high numbers of submissions will be added incrementally. Conclusion VADR improves the speed with which non-flu virus submissions to GenBank can be checked and improves the content and quality of the GenBank annotations. The availability and portability of the software allow researchers to run the GenBank checks prior to submitting their viral sequences, and thereby gain confidence that their submissions will be accepted immediately without the need to correspond with GenBank staff. Reciprocally, the adoption of VADR frees GenBank staff to spend more time on services other than checking routine viral sequence submissions.

[1]  Harry Vennema,et al.  Updated classification of norovirus genogroups and genotypes. , 2019, The Journal of general virology.

[2]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[3]  Tatiana A. Tatusova,et al.  FLAN: a web server for influenza virus genome annotation , 2007, Nucleic Acids Res..

[4]  Sean R. Eddy,et al.  Infernal 1.1: 100-fold faster RNA homology searches , 2013, Bioinform..

[5]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[6]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[7]  J. Kieft,et al.  Mechanism and structural diversity of exoribonuclease-resistant RNA structures in flaviviral RNAs , 2018, Nature Communications.

[8]  Jonathan P. Bollback,et al.  Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA. , 2006, Genome research.

[9]  Peter F. Hallin,et al.  RNAmmer: consistent and rapid annotation of ribosomal RNA genes , 2007, Nucleic acids research.

[10]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[11]  Anders Gorm Pedersen,et al.  RevTrans: multiple alignment of coding DNA from aligned amino acid sequences , 2003, Nucleic Acids Res..

[12]  Pei-Yong Shi,et al.  A highly structured, nuclease-resistant, noncoding RNA produced by flaviviruses is required for pathogenicity. , 2008, Cell host & microbe.

[13]  Yi-Zhou Gao,et al.  Vgas: A Viral Genome Annotation System , 2019, Front. Microbiol..

[14]  Jaideep P. Sundaram,et al.  VIGOR, an annotation program for small viral genomes , 2010, BMC Bioinformatics.

[15]  Federica Monaco,et al.  West Nile alternative open reading frame (N-NS4B/WARF4) is produced in infected West Nile Virus (WNV) cells and induces humoral response in WNV infected individuals , 2012, Virology Journal.

[16]  Michael P. S. Brown,et al.  Small Subunit Ribosomal RNA Modeling Using Stochastic Context-Free Grammars , 2000, ISMB.

[17]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[18]  Alejandro A. Schäffer,et al.  Virus Variation Resource – improved response to emergent viral outbreaks , 2016, Nucleic Acids Res..

[19]  Jaideep P. Sundaram,et al.  VIGOR extended to annotate genomes for additional 12 different viruses , 2012, Nucleic Acids Res..

[20]  Jae-Hak Lee,et al.  rRNASelector: A computer program for selecting ribosomal RNA encoding sequences from metagenomic and metatranscriptomic shotgun libraries , 2011, The Journal of Microbiology.

[21]  M S Waterman,et al.  Genomic sequence databases. , 1990, Genomics.

[22]  Walter N. Moss,et al.  Viral noncoding RNAs: more surprises , 2015, Genes & development.

[23]  B. Strasser The Experimenter's Museum: GenBank, Natural History, and the Moral Economies of Biomedicine , 2011, Isis.

[24]  Amos Bairoch,et al.  ViralZone: a knowledge resource to understand virus diversity , 2010, Nucleic Acids Res..

[25]  Michelle J. Lin,et al.  VAPiD: a lightweight cross-platform viral annotation pipeline and identification tool to facilitate virus genome submissions to NCBI GenBank , 2018, BMC Bioinformatics.

[26]  Pavel V Baranov,et al.  Programmed ribosomal frameshifting in decoding the SARS-CoV genome , 2005, Virology.

[27]  Sean R. Eddy,et al.  A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure , 2002, BMC Bioinformatics.

[28]  I. Tinoco,et al.  A mutant RNA pseudoknot that promotes ribosomal frameshifting in mouse mammary tumor virus. , 1997, Nucleic acids research.

[29]  Eric P. Nawrocki,et al.  NCBI prokaryotic genome annotation pipeline , 2016, Nucleic acids research.

[30]  Matteo Negroni,et al.  RNA Structure—A Neglected Puppet Master for the Evolution of Virus and Host Immunity , 2018, Front. Immunol..

[31]  S. Aguirre,et al.  Dengue virus genomic variation associated with mosquito adaptation defines the pattern of viral non-coding RNAs and fitness in human cells , 2017, PLoS pathogens.

[32]  Sean R. Eddy,et al.  nhmmer: DNA homology search with profile HMMs , 2013, Bioinform..

[33]  Colin Hill,et al.  VIGA: a sensitive, precise and automatic de novo VIral Genome Annotator , 2018, bioRxiv.

[34]  P. Stadler,et al.  Conserved RNA secondary structures in Flaviviridae genomes. , 2004, The Journal of general virology.

[35]  H. Varmus,et al.  Characterization of ribosomal frameshifting in HIV-1 gag-pol expression , 1988, Nature.

[36]  Eric P. Nawrocki,et al.  Structural rna homology search and alignment using covariance models , 2009 .