VIP: an integrated pipeline for metagenomics of virus identification and discovery

Identification and discovery of viruses using next-generation sequencing technology is a fast-developing area with potential wide application in clinical diagnostics, public health monitoring and novel virus discovery. However, tremendous sequence data from NGS study has posed great challenge both in accuracy and velocity for application of NGS study. Here we describe VIP (“Virus Identification Pipeline”), a one-touch computational pipeline for virus identification and discovery from metagenomic NGS data. VIP performs the following steps to achieve its goal: (i) map and filter out background-related reads, (ii) extensive classification of reads on the basis of nucleotide and remote amino acid homology, (iii) multiple k-mer based de novo assembly and phylogenetic analysis to provide evolutionary insight. We validated the feasibility and veracity of this pipeline with sequencing results of various types of clinical samples and public datasets. VIP has also contributed to timely virus diagnosis (~10 min) in acutely ill patients, demonstrating its potential in the performance of unbiased NGS-based clinical studies with demand of short turnaround time. VIP is released under GPLv3 and is available for free download at: https://github.com/keylabivdc/VIP.

[1]  Reinhard Simon,et al.  Complete viral genome sequence and discovery of novel viruses by deep sequencing of small RNAs: a generic method for diagnosis, discovery and sequencing of viruses. , 2009, Virology.

[2]  Yun Zhang,et al.  ViPR: an open bioinformatics database and analysis resource for virology research , 2011, Nucleic Acids Res..

[3]  C. Glaser,et al.  Diagnostic approaches for patients with suspected encephalitis , 2007, Current infectious disease reports.

[4]  J. Derisi,et al.  Virus Identification in Unknown Tropical Febrile Illness Cases Using Deep Sequencing , 2012, PLoS neglected tropical diseases.

[5]  I. Tzanetakis,et al.  Development of a virus detection and discovery pipeline using next generation sequencing. , 2014, Virology.

[6]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[7]  C. Joshi,et al.  Complete Genome Sequence of Bluetongue Virus Serotype 16 of Goat Origin from India , 2012, Journal of Virology.

[8]  Chengsheng Jiang,et al.  Dengue fever: a new challenge for China? , 2014, Global Health Action.

[9]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[10]  I. Dworkin,et al.  A pipeline for the de novo assembly of the Themira biloba (Sepsidae: Diptera) transcriptome using a multiple k-mer length approach , 2014, BMC Genomics.

[11]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[12]  Richard H Scheuermann,et al.  Influenza Research Database: an integrated bioinformatics resource for influenza research and surveillance , 2012, Influenza and other respiratory viruses.

[13]  M. Zaharia,et al.  A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples , 2014, Genome research.

[14]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[15]  Terry Ng,et al.  An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data , 2015, Nucleic acids research.

[16]  Weiqi Wang,et al.  Complete Genome Sequence of Middle East Respiratory Syndrome Coronavirus (MERS-CoV) from the First Imported MERS-CoV Case in China , 2015, Genome Announcements.

[17]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[18]  Zhongming Zhao,et al.  VirusFinder: Software for Efficient and Accurate Detection of Viruses and Their Integration Sites in Host Genomes through Next Generation Sequencing Data , 2013, PloS one.

[19]  Yongan Zhao,et al.  RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data , 2011, Bioinform..

[20]  Kun Qu,et al.  Rapid identification of non-human sequences in high-throughput sequencing datasets , 2012, Bioinform..

[21]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[22]  Garrett E. Schramm,et al.  Predictors of 30-day mortality and hospital costs in patients with ventilator-associated pneumonia attributed to potentially antibiotic-resistant gram-negative bacteria. , 2008, Chest.

[23]  Christian Drosten,et al.  Characterization of a Novel Coronavirus Associated with Severe Acute Respiratory Syndrome , 2003, Science.

[24]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[25]  L. Finelli,et al.  Emergence of a novel swine-origin influenza A (H1N1) virus in humans. , 2009, The New England journal of medicine.

[26]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[27]  Jie Dong,et al.  Human Infection with a Novel Avian-Origin Influenza A (H7N9) Virus. , 2018 .

[28]  M. Cáccamo,et al.  A Viral Discovery Methodology for Clinical Biopsy Samples Utilising Massively Parallel Next Generation Sequencing , 2011, PloS one.

[29]  G. Getz,et al.  PathSeq: software to identify or discover microbes by deep sequencing of human tissue , 2011, Nature Biotechnology.

[30]  Joaquín Dopazo,et al.  ETE: a python Environment for Tree Exploration , 2010, BMC Bioinformatics.

[31]  Hideaki Tanaka,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2011, BCB '11.

[32]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[33]  A. Akobeng,et al.  Understanding diagnostic tests 3: receiver operating characteristic curves , 2007, Acta paediatrica.

[34]  W. M. Dunne,et al.  Next-generation and whole-genome sequencing in the diagnostic clinical microbiology laboratory , 2012, European Journal of Clinical Microbiology & Infectious Diseases.

[35]  Mikiko Senga,et al.  Ebola virus disease in West Africa--the first 9 months of the epidemic and forward projections. , 2014, The New England journal of medicine.

[36]  K. Stedman,et al.  A novel virus genome discovered in an extreme environment suggests recombination between unrelated groups of RNA and DNA viruses , 2012, Biology Direct.

[37]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[38]  A. Kumar,et al.  Emergence of a Novel Swine-Origin Influenza A (H1N1) Virus in Humans , 2010 .

[39]  C. Chiu Viral pathogen discovery , 2013, Current Opinion in Microbiology.

[40]  Po-E Li,et al.  Accurate read-based metagenome characterization using a hierarchical suite of unique signatures , 2015, Nucleic acids research.