V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput sequencing data

High-throughput sequencing technologies are used increasingly, not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence, and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations. To support viral high-throughput sequencing studies, we developed V-pipe, a bioinformatics pipeline combining various state-of-the-art statistical models and computational tools for automated end-to-end analyses of raw sequencing reads. V-pipe supports quality control, read mapping and alignment, low-frequency mutation calling, and inference of viral haplotypes. For generating high-quality read alignments, we developed a novel method, called ngshmmalign, based on profile hidden Markov models and tailored to small and highly diverse viral genomes. V-pipe also includes benchmarking functionality providing a standardized environment for comparative evaluations of different pipeline configurations. We demonstrate this capability by assessing the impact of three different read aligners (Bowtie 2, BWA MEM, ngshmmalign) and two different variant callers (LoFreq, ShoRAH) on the performance of calling single-nucleotide variants in intra-host virus populations. V-pipe supports various pipeline configurations and is implemented in a modular fashion to facilitate adaptations to the continuously changing technology landscape. V-pipe is freely available at https://github.com/cbg-ethz/V-pipe.

[1]  Karin J. Metzner,et al.  Low-frequency drug-resistant HIV-1 and risk of virological failure to first-line NNRTI-based ART: a multicohort European case–control study using centralized ultrasensitive 454 pyrosequencing , 2014, The Journal of antimicrobial chemotherapy.

[2]  M. Zaharia,et al.  A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples , 2014, Genome Research.

[3]  Timothy B. Stockwell,et al.  Quantifying influenza virus diversity and transmission in humans , 2016, Nature Genetics.

[4]  Yinan Wan,et al.  VirAmp: a galaxy-based viral genome assembly pipeline , 2015, GigaScience.

[5]  C. Boucher,et al.  Worldwide Evaluation of DNA Sequencing Approaches for Identification of Drug Resistance Mutations in the Human Immunodeficiency Virus Type 1 Reverse Transcriptase , 1999, Journal of Clinical Microbiology.

[6]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..

[7]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[8]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[9]  M. Capobianchi,et al.  Next-generation sequencing technology in clinical virology. , 2013, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[10]  Guoyan Zhao,et al.  VirusSeeker, a computational pipeline for virus discovery and virome composition analysis. , 2017, Virology.

[11]  M. Peiris,et al.  Association between adverse clinical outcome in human disease caused by novel influenza A H7N9 virus and sustained viral shedding and emergence of antiviral resistance , 2013, The Lancet.

[12]  Niko Beerenwinkel,et al.  Recent advances in inferring viral diversity from high-throughput sequencing data. , 2017, Virus research.

[13]  Zhangjun Fei,et al.  VirusDetect: An automated pipeline for efficient virus discovery using deep sequencing of small RNAs. , 2017, Virology.

[14]  A. Wilm,et al.  LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets , 2012, Nucleic acids research.

[15]  Mark Howison,et al.  Measurement error and variant-calling in deep Illumina sequencing of HIV , 2018, bioRxiv.

[16]  Raul Andino,et al.  Quasispecies Theory and the Behavior of RNA Viruses , 2010, PLoS pathogens.

[17]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[18]  Alexander Schönhuth,et al.  Viral Quasispecies Assembly via Maximal Clique Enumeration , 2014, PLoS Comput. Biol..

[19]  E. Lavezzo,et al.  Next-generation sequencing technologies in diagnostic virology. , 2013, Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology.

[20]  Igor Griva,et al.  A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection , 2017, Bioinform..

[21]  David L. Robertson,et al.  The Evolutionary Analysis of Emerging Low Frequency HIV-1 CXCR4 Using Variants through Time—An Ultra-Deep Approach , 2010, PLoS Comput. Biol..

[22]  Volker Roth,et al.  Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction , 2009, RECOMB.

[23]  James Theiler,et al.  Quantitative Deep Sequencing Reveals Dynamic HIV-1 Escape and Large Population Shifts during CCR5 Antagonist Therapy In Vivo , 2009, PloS one.

[24]  Volker Roth,et al.  HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Yi Zhang,et al.  VIP: an integrated pipeline for metagenomics of virus identification and discovery , 2016, Scientific Reports.

[26]  Daniel L. K. Yamins,et al.  Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition , 2014, PLoS Comput. Biol..

[27]  M. Ronaghi,et al.  Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. , 2007, Genome research.

[28]  Jan Albert,et al.  Population genomics of intrapatient HIV-1 evolution , 2015, eLife.

[29]  Masato Tashiro,et al.  Characterization of Quasispecies of Pandemic 2009 Influenza A Virus (A/H1N1/2009) by De Novo Sequencing Using a Next-Generation DNA Sequencer , 2010, PloS one.

[30]  G. D'offizi,et al.  Quasispecies tropism and compartmentalization in gut and peripheral blood during early and chronic phases of HIV-1 infection: possible correlation with immune activation markers. , 2014, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[31]  E. Arts,et al.  Low-Frequency Drug Resistance in HIV-Infected Ugandans on Antiretroviral Treatment Is Associated with Regimen Failure , 2016, Antimicrobial Agents and Chemotherapy.

[32]  Xiao Yang,et al.  V-Phaser 2: variant inference for viral populations , 2013, BMC Genomics.

[33]  Michael Huber,et al.  MinVar: A rapid and versatile tool for HIV-1 drug resistance genotyping by deep sequencing. , 2017, Journal of virological methods.

[34]  Feng Gao,et al.  Diversity Considerations in HIV-1 Vaccine Selection , 2002, Science.

[35]  M. Imamura,et al.  Rapid emergence of telaprevir resistant hepatitis C virus strain from wildtype clone in vivo , 2011, Hepatology.

[36]  Joel Lexchin,et al.  The Cost of Pushing Pills: A New Estimate of Pharmaceutical Promotion Expenditures in the United States , 2008, PLoS medicine.

[37]  Alexander Schönhuth,et al.  De novo assembly of viral quasispecies using overlap graphs , 2017, bioRxiv.

[38]  Emma R Lee,et al.  A MiSeq-HyDRA platform for enhanced HIV drug resistance genotyping and surveillance , 2019, Scientific Reports.

[39]  N. Beerenwinkel,et al.  Accurate single nucleotide variant detection in viral populations by combining probabilistic clustering with a statistical test of strand bias , 2013, BMC Genomics.

[40]  Volker Roth,et al.  Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations , 2014, Nucleic acids research.

[41]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[42]  M A Nowak,et al.  Antigenic diversity thresholds and the development of AIDS. , 1991, Science.

[43]  E. Holmes,et al.  Rates of evolutionary change in viruses: patterns and determinants , 2008, Nature Reviews Genetics.

[44]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[45]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[46]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[47]  I. Tzanetakis,et al.  Development of a virus detection and discovery pipeline using next generation sequencing. , 2014, Virology.

[48]  Volker Roth,et al.  Probabilistic Inference of Viral Quasispecies Subject to Recombination , 2012, RECOMB.

[49]  Eleazar Eskin,et al.  Accurate viral population assembly from ultra-deep sequencing data , 2014, Bioinform..

[50]  Jan Paul Medema,et al.  Betulin Is a Potent Anti-Tumor Agent that Is Enhanced by Cholesterol , 2009, PloS one.

[51]  R. Shafer,et al.  2019 update of the drug resistance mutations in HIV-1. , 2019, Topics in antiviral medicine.

[52]  Anders Lansner,et al.  Bistable, Irregular Firing and Population Oscillations in a Modular Attractor Memory Network , 2010, PLoS Comput. Biol..

[53]  A. Luetkemeyer,et al.  Understanding Hepatitis C Virus Drug Resistance: Clinical Implications for Current and Future Regimens. , 2017, Topics in antiviral medicine.

[54]  M. Vignuzzi,et al.  Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population , 2006, Nature.

[55]  Keijo Heljanko,et al.  ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads , 2017, Bioinform..

[56]  T. F. Rinke de Wit,et al.  Low-abundance drug-resistant HIV-1 variants in antiretroviral drug-naïve individuals: A systematic review of detection methods, prevalence, and clinical impact. , 2019, The Journal of infectious diseases.

[57]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[58]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[59]  Nancy R. Zhang,et al.  Ultrasensitive detection of rare mutations using next-generation targeted resequencing , 2011, Nucleic acids research.

[60]  Huldrych F. Günthard,et al.  Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection , 2012, PLoS pathogens.

[61]  Manja Marz,et al.  Direct RNA nanopore sequencing of full-length coronavirus genomes provides novel insights into structural variants and enables modification analysis , 2018, bioRxiv.

[62]  Emma R Lee,et al.  Performance comparison of next generation sequencing analysis pipelines for HIV-1 drug resistance testing , 2020, Scientific Reports.

[63]  Eleazar Eskin,et al.  Long single-molecule reads can resolve the complexity of the Influenza virus composed of rare, closely related mutant variants , 2016, bioRxiv.

[64]  J. Wu,et al.  Comparison of antiviral resistance across acute and chronic viral infections , 2018, Antiviral research.

[65]  K. Metzner,et al.  Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data , 2012, Front. Microbio..

[66]  I. Rigoutsos,et al.  The complex transcriptional landscape of the anucleate human platelet , 2013, BMC Genomics.

[67]  Astrid Gall,et al.  Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver , 2018, Virus evolution.

[68]  B. Masquelier,et al.  Prevalence and Evolution of Low Frequency HIV Drug Resistance Mutations Detected by Ultra Deep Sequencing in Patients Experiencing First Line Antiretroviral Therapy Failure , 2014, PloS one.

[69]  E. Domingo,et al.  Quasispecies dynamics and RNA virus extinction. , 2005, Virus research.

[70]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[71]  Sven Rahmann,et al.  Genome analysis , 2022 .

[72]  Madeleine C. Mankowski,et al.  Extra-epitopic hepatitis C virus polymorphisms confer resistance to broadly neutralizing antibodies by modulating binding to scavenger receptor B1 , 2017, PLoS Pathogens.

[73]  Tanmoy Bhattacharya,et al.  Modeling sequence evolution in acute HIV-1 infection. , 2009, Journal of theoretical biology.

[74]  Michael Monsour,et al.  Minority HIV-1 Drug Resistance Mutations Are Present in Antiretroviral Treatment–Naïve Populations and Associate with Reduced Treatment Efficacy , 2008, PLoS medicine.

[75]  Saman K. Halgamuge,et al.  ViQuaS: an improved reconstruction pipeline for viral quasispecies spectra generated by next-generation sequencing , 2015, Bioinform..

[76]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.