Evaluation of haplotype callers for next-generation sequencing of viruses

Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. However, previous studies suggest that current approaches of haplotype reconstruction greatly underestimate intra-host diversity. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. Parameters for the simulated data spanned known fast evolving viruses (e.g., HIV-1) diversity estimates to test the limits of the haplotype reconstruction methods and ensured coverage of predicted intra-host viral diversity levels. Using those parameters, we simulated HIV-1 viral populations of 216-1,185 haplotypes per host at a frequency <7%. All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction accuracy was highly variable and, on average, poor. High diversity levels led to severe underestimation of, with a few tools greatly overestimating, the true number of haplotypes. PredictHaplo and PEHaplo produced estimates close to the true number of haplotypes, although their haplotype reconstruction accuracy was worse than that of the other ten tools. We conclude that haplotype reconstruction from NGS short reads is unreliable due to high genetic diversity of fast-evolving viruses. Local haplotype reconstruction of longer reads to phase variants may provide a more reliable estimation of viral variants within a population. Highlights Haplotype callers for NGS data vary greatly in their performance. Haplotype callers performance is mainly determined by mutation rate. Haplotype callers performance is less sensitive to effective population size. Most haplotype callers perform well with low diversity and poorly with high diversity. PredictHaplo performs best if genetic diversity is in the range of HIV diversity.

[1]  C. J-F,et al.  THE COALESCENT , 1980 .

[2]  Mark L. Pearson,et al.  Complete nucleotide sequence of the AIDS virus, HTLV-III , 1985, Nature.

[3]  J. Coffin,et al.  Genetic diversity and evolution of retroviruses. , 1992, Current topics in microbiology and immunology.

[4]  K. Crandall,et al.  Empirical tests of some predictions from coalescent theory with applications to intraspecific phylogeny reconstruction. , 1993, Genetics.

[5]  K. Crandall,et al.  Parallel evolution of drug resistance in HIV: failure of nonsynonymous/synonymous substitution rate ratio to detect selection. , 1999, Molecular biology and evolution.

[6]  L. Mansky In Vivo Analysis of Human T-Cell Leukemia Virus Type 1 Reverse Transcription Accuracy , 2000, Journal of Virology.

[7]  J. Kingman Origins of the coalescent. 1974-1982. , 2000, Genetics.

[8]  D. Posada,et al.  Selecting models of nucleotide substitution: an application to human immunodeficiency virus 1 (HIV-1). , 2001, Molecular biology and evolution.

[9]  Niccolo Leo Caldararo,et al.  The evolution of HIV , 2001 .

[10]  K. Crandall,et al.  Recombination in evolutionary genomics. , 2002, Annual review of genetics.

[11]  Noah A. Rosenberg,et al.  Genealogical trees, coalescent theory and the analysis of genetic polymorphisms , 2002, Nature Reviews Genetics.

[12]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[13]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[14]  B. Torbett,et al.  Crystal structure of an FIV/HIV chimeric protease complexed with the broad-based inhibitor, TL-3 , 2007, Retrovirology.

[15]  Marion Cornelissen,et al.  Identifying HIV-1 dual infections , 2007, Retrovirology.

[16]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[17]  J. Margolick,et al.  Human Immunodeficiency Virus Type 1 Population Genetics and Adaptation in Newly Infected Individuals , 2008, Journal of Virology.

[18]  Lior Pachter,et al.  Viral Population Estimation Using Pyrosequencing , 2007, PLoS Comput. Biol..

[19]  M. Pérez‐Losada,et al.  Phylodynamics of HIV-1 from a Phase-III AIDS Vaccine Trial in North America , 2009, Molecular biology and evolution.

[20]  Volker Roth,et al.  Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction , 2009, RECOMB.

[21]  Volker Roth,et al.  HIV-Haplotype Inference using a Constraint-based Dirichlet Process Mixture Model , 2010 .

[22]  Edward C Holmes,et al.  The RNA virus quasispecies: fact or fiction? , 2010, Journal of molecular biology.

[23]  R. Sanjuán,et al.  Viral Mutation Rates , 2010, Journal of Virology.

[24]  Thomas Leitner,et al.  Recombination Rate and Selection Strength in HIV Intra-patient Evolution , 2009, PLoS Comput. Biol..

[25]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[26]  Ion I. Mandoiu,et al.  Inferring viral quasispecies spectra from 454 pyrosequencing reads , 2011, BMC Bioinformatics.

[27]  M. Ciccozzi,et al.  Within-Host Dynamics of the Hepatitis C Virus Quasispecies Population in HIV-1/HCV Coinfected Patients , 2011, PloS one.

[28]  Sebastián M. Real,et al.  E2F1 Regulates Cellular Growth by mTORC1 Signaling , 2011, PloS one.

[29]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[30]  N. Friedman,et al.  Trinity : reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2016 .

[31]  Sorin Istrail,et al.  QColors: An algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[32]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[33]  Niko Beerenwinkel,et al.  Ultra-deep sequencing for the analysis of viral populations. , 2011, Current opinion in virology.

[34]  Elizabeth M. Ryan,et al.  De novo assembly of highly diverse viral populations , 2012, BMC Genomics.

[35]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[36]  Volker Roth,et al.  Probabilistic Inference of Viral Quasispecies Subject to Recombination , 2012, RECOMB.

[37]  Huldrych F. Günthard,et al.  Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection , 2012, PLoS pathogens.

[38]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[39]  Nan Li,et al.  Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. , 2012, Briefings in functional genomics.

[40]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..

[41]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[42]  Alan S. Perelson,et al.  Quantifying the Diversification of Hepatitis C Virus (HCV) during Primary Infection: Estimates of the In Vivo Mutation Rate , 2012, PLoS pathogens.

[43]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[44]  Ion I. Mandoiu,et al.  Reconstructing viral quasispecies from NGS amplicon reads , 2012, Silico Biol..

[45]  Mikhail G. Dozmorov,et al.  Systematic classification of non-coding RNAs by epigenomic similarity , 2013, BMC Bioinformatics.

[46]  R. Stephens,et al.  HIV Populations Are Large and Accumulate High Genetic Diversity in a Nonlinear Fashion , 2013, Journal of Virology.

[47]  Volker Roth,et al.  Probabilistic Inference of Viral Quasispecies Subject to Recombination , 2013, J. Comput. Biol..

[48]  Ion I. Mandoiu,et al.  Reconstruction of viral population structure from next-generation sequencing data using multicommodity flows , 2013, BMC Bioinformatics.

[49]  Xiao Yang,et al.  V-Phaser 2: variant inference for viral populations , 2013, BMC Genomics.

[50]  Raj Acharya,et al.  Mutant-Bin: Unsupervised Haplotype Estimation of Viral Population Diversity Without Reference Genome , 2013, J. Comput. Biol..

[51]  Li Yin,et al.  Empirical validation of viral quasispecies assembly algorithms: state-of-the-art and challenges , 2013, Scientific Reports.

[52]  Rob J de Boer,et al.  Reliable reconstruction of HIV-1 whole genome haplotypes reveals clonal interference and genetic hitchhiking among immune escape variants , 2013, Retrovirology.

[53]  Volker Roth,et al.  Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations , 2014, Nucleic acids research.

[54]  Christopher Quince,et al.  Benchmarking of viral haplotype reconstruction programmes: an overview of the capacities and limitations of currently available programmes , 2014, Briefings Bioinform..

[55]  D. Posada,et al.  Simulation of Genome-Wide Evolution under Heterogeneous Substitution Models and Complex Multispecies Coalescent Histories , 2014, Molecular biology and evolution.

[56]  Volker Roth,et al.  HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[57]  Eleazar Eskin,et al.  Accurate viral population assembly from ultra-deep sequencing data , 2014, Bioinform..

[58]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[59]  Alexander Schönhuth,et al.  Viral Quasispecies Assembly via Maximal Clique Enumeration , 2014, PLoS Comput. Biol..

[60]  Astrid Gall,et al.  IVA: accurate de novo assembly of RNA virus genomes , 2015, Bioinform..

[61]  David J. Anderson,et al.  Ventromedial hypothalamic neurons control a defensive emotion state , 2015, eLife.

[62]  J. Cristina,et al.  Hepatitis C virus genetic variability and evolution. , 2015, World journal of hepatology.

[63]  C. Labandeira,et al.  New data from the Middle Jurassic of China shed light on the phylogeny and origin of the proboscis in the Mesopsychidae (Insecta: Mecoptera) , 2016, BMC Evolutionary Biology.

[64]  Maximum Likelihood de novo reconstruction of viral populations using paired end sequencing data , 2015, 1502.04239.

[65]  Jan Albert,et al.  Population genomics of intrapatient HIV-1 evolution , 2015, eLife.

[66]  Saman K. Halgamuge,et al.  ViQuaS: an improved reconstruction pipeline for viral quasispecies spectra generated by next-generation sequencing , 2015, Bioinform..

[67]  Yuseob Kim,et al.  Population genetic processes affecting the mode of selective sweeps and effective population size in influenza virus H3N2 , 2016, BMC Evolutionary Biology.

[68]  Duy Tin Truong,et al.  Strain-level microbial epidemiology and population genomics from shotgun metagenomics , 2016, Nature Methods.

[69]  J. Mellors,et al.  Ultrasensitive single-genome sequencing: accurate, targeted, next generation sequencing of HIV-1 RNA , 2016, Retrovirology.

[70]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[71]  Igor Griva,et al.  A penalized regression approach to haplotype reconstruction of viral populations arising in early HIV/SIV infection , 2017, Bioinform..

[72]  Alexander Schönhuth,et al.  De novo assembly of viral quasispecies using overlap graphs , 2017, bioRxiv.

[73]  Niko Beerenwinkel,et al.  Recent advances in inferring viral diversity from high-throughput sequencing data. , 2017, Virus research.

[74]  Taylor J. Maxwell,et al.  Characterization of HIV diversity, phylodynamics and drug resistance in Washington, DC , 2017, PloS one.

[75]  Eleazar Eskin,et al.  Long single-molecule reads can resolve the complexity of the Influenza virus composed of rare, closely related mutant variants , 2016, bioRxiv.

[76]  Min Zhang,et al.  Semaphorin3A induces nerve regeneration in the adult cornea-a switch from its repulsive role in development , 2018, PloS one.

[77]  Leen Stougie,et al.  Full-length de novo viral quasispecies assembly through variation graph construction , 2018, bioRxiv.

[78]  Haris Vikalo,et al.  aBayesQR: A Bayesian method for reconstruction of viral populations characterized by low diversity , 2017, bioRxiv.

[79]  David Koslicki,et al.  EMDUniFrac: exact linear time computation of the UniFrac metric and identification of differentially abundant organisms , 2016, Journal of Mathematical Biology.

[80]  Haris Vikalo,et al.  QSdpR: Viral quasispecies reconstruction via correlation clustering. , 2017, Genomics.

[81]  Yingchao Zhao,et al.  De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding , 2018, bioRxiv.

[82]  A. Monto,et al.  Stochastic processes constrain the within and between host evolution of influenza virus , 2018, eLife.

[83]  Pavel Skums,et al.  Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction , 2018, bioRxiv.

[84]  E. C. Júnior,et al.  Low genetic diversity of the Human T-cell Lymphotropic Virus (HTLV-1) in an endemic area of the Brazilian Amazon basin , 2018, PloS one.

[85]  Serafim Batzoglou,et al.  High-quality genome sequences of uncultured microbes by assembly of read clouds , 2018, Nature Biotechnology.

[86]  John McCrone Iv Influenza Virus Evolution Within and Between Human Hosts , 2018 .

[87]  L. Stougie,et al.  Strain-aware assembly of genomes from mixed samples using variation graphs , 2019, bioRxiv.

[88]  B. Teusink,et al.  Finding Functional Differences Between Species in a Microbial Community: Case Studies in Wine Fermentation and Kefir Culture , 2019, Front. Microbiol..

[89]  L. Stougie,et al.  Viral quasispecies reconstruction via contig abundance estimation in variation graphs , 2019 .

[90]  Margaret C Steiner,et al.  A 28-Year History of HIV-1 Drug Resistance and Transmission in Washington, DC , 2019, Front. Microbiol..

[91]  Keylie M. Gibson,et al.  HAPHPIPE: Haplotype Reconstruction and Phylodynamics for Deep Sequencing of Intrahost Viral Populations , 2020, Molecular biology and evolution.

[92]  Alexander Zelikovsky,et al.  CliqueSNV: An Efficient Noise Reduction Technique for Accurate Assembly of ViralVariants from NGS Data , 2020 .

[93]  Parallel Evolution , 2021, Encyclopedic Dictionary of Archaeology.