VIGA: a sensitive, precise and automatic de novo VIral Genome Annotator

Viral (meta)genomics is a rapidly growing field of study that is hampered by an inability to annotate the majority of viral sequences; therefore, the development of new bioinformatic approaches is very important. Here, we present a new automatic de novo genome annotation pipeline, called VIGA, to annotate prokaryotic and eukaryotic viral sequences from (meta)genomic studies. VIGA was benchmarked on a database of known viral genomes and a viral metagenomics case study. VIGA generated the most accurate outputs according to the number of coding sequences and their coordinates, outputs also had a lower number of non-informative annotations compared to other programs.

[1]  Yun Zhang,et al.  ViPR: an open bioinformatics database and analysis resource for virology research , 2011, Nucleic Acids Res..

[2]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[3]  Yang Young Lu,et al.  VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data , 2017, Microbiome.

[4]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[5]  Fangfang Xia,et al.  RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes , 2015, Scientific Reports.

[6]  K. Mawhinney,et al.  Characterization of the genome of avian encephalomyelitis virus with cloned cDNA fragments. , 1999, Avian diseases.

[7]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[8]  R. Overbeek,et al.  Phage Genome Annotation Using the RAST Pipeline. , 2018, Methods in molecular biology.

[9]  J. Mast,et al.  Proteomic Characterization of Bovine Herpesvirus 4 Extracellular Virions , 2012, Journal of Virology.

[10]  Robert D. Finn,et al.  Rfam 12.0: updates to the RNA families database , 2014, Nucleic Acids Res..

[11]  David Wang,et al.  Origins and challenges of viral dark matter. , 2017, Virus research.

[12]  Vincent Montoya,et al.  Metagenomics for pathogen detection in public health , 2013, Genome Medicine.

[13]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[14]  Matthew Fraser,et al.  InterProScan 5: genome-scale protein function classification , 2014, Bioinform..

[15]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[16]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[17]  Eugene V. Koonin,et al.  Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation , 2016, Nucleic Acids Res..

[18]  S. Ghabrial,et al.  Molecular characterization of Penicillium chrysogenum virus: reconsideration of the taxonomy of the genus Chrysovirus. , 2004, The Journal of general virology.

[19]  F. Bushman,et al.  Viral communities of the human gut: metagenomic analysis of composition and dynamics , 2017, Mobile DNA.

[20]  Tatiana A. Tatusova,et al.  FLAN: a web server for influenza virus genome annotation , 2007, Nucleic Acids Res..

[21]  Brian C. Thomas,et al.  Measurement of bacterial replication rates in microbial communities , 2016, Nature Biotechnology.

[22]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[23]  Sean R. Eddy,et al.  Infernal 1.1: 100-fold faster RNA homology searches , 2013, Bioinform..

[24]  N. S. Zaidi,et al.  Protein sequence conservation and stable molecular evolution reveals influenza virus nucleoprotein as a universal druggable target. , 2015, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[25]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[26]  Birgit Eisenhaber,et al.  Powerful Sequence Similarity Search Methods and In-Depth Manual Analyses Can Identify Remote Homologs in Many Apparently “Orphan” Viral Proteins , 2013, Journal of Virology.

[27]  L. Enquist,et al.  Proteomic Characterization of Pseudorabies Virus Extracellular Virions , 2011, Journal of Virology.

[28]  Martha R. J. Clokie,et al.  Phages in nature , 2011, Bacteriophage.

[29]  E. Koonin,et al.  Abundance of type I toxin–antitoxin systems in bacteria: searches for new candidates and discovery of novel families , 2010, Nucleic acids research.

[30]  Mark Borodovsky,et al.  GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses , 2005, Nucleic Acids Res..

[31]  Edward C. Uberbacher,et al.  Gene and translation initiation site prediction in metagenomic sequences , 2012, Bioinform..

[32]  Benjamin Bolduc,et al.  Healthy human gut phageome , 2016, Proceedings of the National Academy of Sciences.

[33]  Donald Sharon,et al.  Characterization of the Dynamic Transcriptome of a Herpesvirus with Long-read Single Molecule Real-Time Sequencing , 2017, Scientific Reports.

[34]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[35]  T. Tatusova,et al.  Solving the Problem: Genome Annotation Standards before the Data Deluge , 2011, Standards in genomic sciences.

[36]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[37]  Nicola K. Petty,et al.  Essential Steps in Characterizing Bacteriophages: Biology, Taxonomy, and Genome Analysis. , 2018, Methods in molecular biology.

[38]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[39]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[40]  Robert S. Harris,et al.  Improved pairwise alignment of genomic dna , 2007 .

[41]  Robert D. Finn,et al.  EBI metagenomics in 2016 - an expanding and evolving resource for the analysis and archiving of metagenomic data , 2015, Nucleic Acids Res..

[42]  Martha R. J. Clokie,et al.  Genomic and proteomic characterization of two novel siphovirus infecting the sedentary facultative epibiont cyanobacterium Acaryochloris marina. , 2015, Environmental microbiology.

[43]  Richard Myers,et al.  Variability and conservation in hepatitis B virus core protein , 2005, BMC Microbiology.

[44]  Guoyan Zhao,et al.  VirusSeeker, a computational pipeline for virus discovery and virome composition analysis. , 2017, Virology.

[45]  Gary Benson,et al.  Inverted repeat structure of the human genome: the X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. , 2004, Genome research.

[46]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[47]  Jaideep P. Sundaram,et al.  VIGOR extended to annotate genomes for additional 12 different viruses , 2012, Nucleic Acids Res..

[48]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[49]  Andrew G McDonald,et al.  Fifty‐five years of enzyme classification: advances and difficulties , 2014, The FEBS journal.

[50]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[51]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[52]  Andrew J. Davison,et al.  Consensus statement: Virus taxonomy in the age of metagenomics , 2017, Nature Reviews Microbiology.

[53]  J. Devereux,et al.  A comprehensive set of sequence analysis programs for the VAX , 1984, Nucleic Acids Res..

[54]  Tin Wee Tan,et al.  Conservation and Variability of Dengue Virus Proteins: Implications for Vaccine Design , 2008, PLoS neglected tropical diseases.

[55]  Matthew B. Sullivan,et al.  VirSorter: mining viral signal from microbial genomic data , 2015, PeerJ.

[56]  Pierre Baldi,et al.  VIRALpro: a tool to identify viral capsid and tail sequences , 2016, Bioinform..

[57]  Q. Gu,et al.  Transcriptome analysis of Cucumis sativus infected by Cucurbit chlorotic yellows virus , 2017, Virology Journal.

[58]  Owen White,et al.  Toward a standard in structural genome annotation for prokaryotes , 2015, Standards in genomic sciences.

[59]  Robert C. Edgar,et al.  PILER-CR: Fast and accurate identification of CRISPR repeats , 2007, BMC Bioinformatics.

[60]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[61]  Emiley A. Eloe-Fadrosh,et al.  Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity , 2017, PeerJ.

[62]  Minoru Kanehisa,et al.  KEGG as a reference resource for gene and protein annotation , 2015, Nucleic Acids Res..

[63]  Dean Laslett,et al.  ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. , 2004, Nucleic acids research.

[64]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[65]  N. Suzuki,et al.  A Novel Bipartite Double-Stranded RNA Mycovirus from the White Root Rot Fungus Rosellinia necatrix: Molecular and Biological Characterization, Taxonomic Considerations, and Potential for Biological Control , 2009, Journal of Virology.

[66]  M. Katze,et al.  Transcriptomic Characterization of the Novel Avian-Origin Influenza A (H7N9) Virus: Specific Host Response and Responses Intermediate between Avian (H5N1 and H7N7) and Human (H3N2) Viruses and Implications for Treatment Options , 2014, mBio.

[67]  Victor Seguritan,et al.  Artificial Neural Networks Trained to Detect Viral and Phage Structural Proteins , 2012, PLoS Comput. Biol..