Recent advances in inferring viral diversity from high-throughput sequencing data.

Rapidly evolving RNA viruses prevail within a host as a collection of closely related variants, referred to as viral quasispecies. Advances in high-throughput sequencing (HTS) technologies have facilitated the assessment of the genetic diversity of such virus populations at an unprecedented level of detail. However, analysis of HTS data from virus populations is challenging due to short, error-prone reads. In order to account for uncertainties originating from these limitations, several computational and statistical methods have been developed for studying the genetic heterogeneity of virus population. Here, we review methods for the analysis of HTS reads, including approaches to local diversity estimation and global haplotype reconstruction. Challenges posed by aligning reads, as well as the impact of reference biases on diversity estimates are also discussed. In addition, we address some of the experimental approaches designed to improve the biological signal-to-noise ratio. In the future, computational methods for the analysis of heterogeneous virus populations are likely to continue being complemented by technological developments.

[1]  E. Domingo,et al.  RNA virus mutations and fitness for survival. , 1997, Annual review of microbiology.

[2]  Pavel Skums,et al.  Efficient error correction for next-generation sequencing of viral amplicons , 2012, BMC Bioinformatics.

[3]  Niko Beerenwinkel,et al.  Read length versus Depth of Coverage for Viral Quasispecies Reconstruction , 2012, PloS one.

[4]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[5]  Adam Zemla,et al.  The Role of Viral Population Diversity in Adaptation of Bovine Coronavirus to New Host Environments , 2013, PloS one.

[6]  K. Kinzler,et al.  Detection and quantification of rare mutations with massively parallel sequencing , 2011, Proceedings of the National Academy of Sciences.

[7]  Cassandra B. Jabara,et al.  Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID , 2011, Proceedings of the National Academy of Sciences.

[8]  M. Gerstung,et al.  Reliable detection of subclonal single-nucleotide variants in tumour cell populations , 2012, Nature Communications.

[9]  Volker Roth,et al.  Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction , 2009, RECOMB.

[10]  Chi-Ching Lee,et al.  Construction and analysis of a plant non-specific lipid transfer protein database (nsLTPDB) , 2012, BMC Genomics.

[11]  Ion I. Mandoiu,et al.  Inferring viral quasispecies spectra from 454 pyrosequencing reads , 2011, BMC Bioinformatics.

[12]  S. Lewin,et al.  HBV mutations in untreated HIV-HBV co-infection using genomic length sequencing. , 2010, Virology.

[13]  Yadong Wang,et al.  rHAT: fast alignment of noisy long reads with regional hashing , 2016, Bioinform..

[14]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[15]  C. Dekker,et al.  DNA sequencing with nanopores , 2012, Nature Biotechnology.

[16]  M. Eigen,et al.  The Hypercycle , 2004, Naturwissenschaften.

[17]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[18]  Ion I. Mandoiu,et al.  Reconstructing viral quasispecies from NGS amplicon reads , 2012, Silico Biol..

[19]  Christopher Quince,et al.  Benchmarking of viral haplotype reconstruction programmes: an overview of the capacities and limitations of currently available programmes , 2014, Briefings Bioinform..

[20]  Martin Goodson,et al.  Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. , 2011, Genome research.

[21]  Jeroen Aerssens,et al.  VirVarSeq: a low-frequency virus variant detection pipeline for Illumina sequencing using adaptive base-calling accuracy filtering , 2015, Bioinform..

[22]  M. Obrist,et al.  Effective Long-Distance Pollen Dispersal in Centaurea jacea , 2009, PloS one.

[23]  Bongkyun Park,et al.  Novel Porcine Epidemic Diarrhea Virus Variant with Large Genomic Deletion, South Korea , 2014, Emerging infectious diseases.

[24]  Saman K. Halgamuge,et al.  ViQuaS: an improved reconstruction pipeline for viral quasispecies spectra generated by next-generation sequencing , 2015, Bioinform..

[25]  M. Eigen,et al.  The Hypercycle: A principle of natural self-organization , 2009 .

[26]  Jin-Kao Hao,et al.  Pattern Recognition in Bioinformatics: 8th IAPR International Conference, PRIB 2013, Nice, France, June 17-20, 2013. Proceedings , 2013 .

[27]  Jan Albert,et al.  Population genomics of intrapatient HIV-1 evolution , 2015, eLife.

[28]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..

[29]  Xiao Yang,et al.  V-Phaser 2: variant inference for viral populations , 2013, BMC Genomics.

[30]  Sen-Lin Tang,et al.  Accurate reconstruction of viral quasispecies spectra through improved estimation of strain richness , 2015, BMC Bioinformatics.

[31]  Huldrych F. Günthard,et al.  Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection , 2012, PLoS pathogens.

[32]  Sorin Istrail,et al.  QColors: An algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads , 2011, BIBM Workshops.

[33]  Hamidreza Chitsaz,et al.  HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly , 2014, BMC Genomics.

[34]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[35]  David A. Matthews,et al.  Real-time, portable genome sequencing for Ebola surveillance , 2016, Nature.

[36]  M. Vignuzzi,et al.  Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population , 2006, Nature.

[37]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[38]  Ofer Isakov,et al.  Deep sequencing analysis of viral infection and evolution allows rapid and detailed characterization of viral mutant spectrum , 2015, Bioinform..

[39]  Yun Sung Cho,et al.  Genomic profile analysis of diffuse-type gastric cancers , 2014, Genome Biology.

[40]  Xiang Wan,et al.  Bioinformatics Research and Applications , 2016, Lecture Notes in Computer Science.

[41]  Li Yin,et al.  Empirical validation of viral quasispecies assembly algorithms: state-of-the-art and challenges , 2013, Scientific Reports.

[42]  Niranjan Nagarajan,et al.  BAsE-Seq: a method for obtaining long viral haplotypes from short sequence reads , 2014, Genome Biology.

[43]  E. Holmes,et al.  Rates of evolutionary change in viruses: patterns and determinants , 2008, Nature Reviews Genetics.

[44]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[45]  Ting Chen,et al.  ProbAlign: a re-alignment method for long sequencing reads , 2014, bioRxiv.

[46]  T. Dallman,et al.  Performance comparison of benchtop high-throughput sequencing platforms , 2012, Nature Biotechnology.

[47]  V. Deubel,et al.  Extensive nucleotide changes and deletions within the envelope glycoprotein gene of Euro-African West Nile viruses. , 1997, The Journal of general virology.

[48]  Ion I. Mandoiu,et al.  Reconstruction of viral population structure from next-generation sequencing data using multicommodity flows , 2013, BMC Bioinformatics.

[49]  Giovanni Ulivi,et al.  Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing , 2011, BMC Bioinformatics.

[50]  Piotr Berman,et al.  HCV Quasispecies Assembly Using Network Flows , 2008, ISBRA.

[51]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[52]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[53]  K. Metzner,et al.  Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data , 2012, Front. Microbio..

[54]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[55]  V. Marx Microbiology: the road to strain-level identification , 2016, Nature Methods.

[56]  Alexander Schönhuth,et al.  Viral Quasispecies Assembly via Maximal Clique Enumeration , 2014, PLoS Comput. Biol..

[57]  Attila Szolnoki,et al.  Evolutionary Establishment of Moral and Double Moral Standards through Spatial Interactions , 2010, PLoS Comput. Biol..

[58]  George Tomlinson,et al.  Vitamin K Supplementation in Postmenopausal Women with Osteopenia (ECKO Trial): A Randomized Controlled Trial , 2008, PLoS medicine.

[59]  Jiang Li,et al.  The effect of strand bias in Illumina short-read sequencing data , 2012, BMC Genomics.

[60]  Astrid Gall,et al.  IVA: accurate de novo assembly of RNA virus genomes , 2015, Bioinform..

[61]  G. Pantaleo,et al.  Long sequence duplications, repeats, and palindromes in HIV-1 gp120: length variation in V4 as the product of misalignment mechanism. , 2010, Virology.

[62]  K. Metzner,et al.  A Comprehensive Analysis of Primer IDs to Study Heterogeneous HIV-1 Populations. , 2016, Journal of molecular biology.

[63]  G. D'offizi,et al.  Quasispecies tropism and compartmentalization in gut and peripheral blood during early and chronic phases of HIV-1 infection: possible correlation with immune activation markers. , 2014, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[64]  Raj Acharya,et al.  Estimating Viral Haplotypes in a Population Using k-mer Counting , 2013, PRIB.

[65]  Feng Gao,et al.  Diversity Considerations in HIV-1 Vaccine Selection , 2002, Science.

[66]  N. Beerenwinkel,et al.  Accurate single nucleotide variant detection in viral populations by combining probabilistic clustering with a statistical test of strand bias , 2013, BMC Genomics.

[67]  Xuegong Zhang,et al.  Research in Computational Molecular Biology , 2013, Lecture Notes in Computer Science.

[68]  John E. Johnson,et al.  CoVaMa: Co-Variation Mapper for disequilibrium analysis of mutant loci in viral populations using next-generation sequence data. , 2015, Methods.

[69]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[70]  M. Eigen,et al.  What is a quasispecies? , 2006, Current topics in microbiology and immunology.

[71]  M. Nowak,et al.  Pre–existence and emergence of drug resistance in HIV–1 infection , 1997, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[72]  Maximum Likelihood de novo reconstruction of viral populations using paired end sequencing data , 2015, 1502.04239.

[73]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[74]  Nebojsa Jojic,et al.  Population Sequencing Using Short Reads: HIV as a Case Study , 2008, Pacific Symposium on Biocomputing.

[75]  Barbara Zehnbauer,et al.  Thymidylate Synthase Genotype-Directed Chemotherapy for Patients with Gastric and Gastroesophageal Junction Cancers , 2014, PloS one.

[76]  Shanfeng Zhu,et al.  MHC2SKpan: a novel kernel based approach for pan-specific MHC class II peptide binding prediction , 2013, BMC Genomics.

[77]  Matthias Bethge,et al.  Slowness and Sparseness Have Diverging Effects on Complex Cell Learning , 2014, PLoS Comput. Biol..

[78]  P. Scuffham,et al.  Patient Preferences and Willingness-To-Pay for a Home or Clinic Based Program of Chronic Heart Failure Management: Findings from the Which? Trial , 2013, PloS one.

[79]  Alfredo Tirado-Ramos,et al.  Multiplexed highly-accurate DNA sequencing of closely-related HIV-1 variants using continuous long reads from single molecule, real-time sequencing , 2015, Nucleic acids research.

[80]  Volker Roth,et al.  HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[81]  Rob J de Boer,et al.  Reliable reconstruction of HIV-1 whole genome haplotypes reveals clonal interference and genetic hitchhiking among immune escape variants , 2013, Retrovirology.

[82]  Volker Roth,et al.  Probabilistic Inference of Viral Quasispecies Subject to Recombination , 2013, J. Comput. Biol..

[83]  S. Caboche,et al.  Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data , 2014, BMC Genomics.

[84]  N. Volkmann,et al.  The IRE1α/XBP1s Pathway Is Essential for the Glucose Response and Protection of β Cells , 2015, PLoS biology.

[85]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[86]  Matthew B. Sullivan,et al.  The Pacific Ocean Virome (POV): A Marine Viral Metagenomic Dataset and Associated Protein Clusters for Quantitative Viral Ecology , 2013, PloS one.

[87]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[88]  Byung-Jun Yoon,et al.  Hidden Markov Models and their Applications in Biological Sequence Analysis , 2009, Current genomics.

[89]  L. Enjuanes,et al.  Antigenic modules in the N-terminal S1 region of the transmissible gastroenteritis virus spike protein. , 2011, The Journal of general virology.

[90]  John W. Mellors,et al.  Multiple, Linked Human Immunodeficiency Virus Type 1 Drug Resistance Mutations in Treatment-Experienced Patients Are Missed by Standard Genotype Analysis , 2005, Journal of Clinical Microbiology.

[91]  B. Masquelier,et al.  Prevalence and Evolution of Low Frequency HIV Drug Resistance Mutations Detected by Ultra Deep Sequencing in Patients Experiencing First Line Antiretroviral Therapy Failure , 2014, PloS one.

[92]  E. Domingo,et al.  Quasispecies dynamics and RNA virus extinction. , 2005, Virus research.

[93]  Raul Andino,et al.  Quasispecies Theory and the Behavior of RNA Viruses , 2010, PLoS pathogens.

[94]  Elizabeth M. Ryan,et al.  De novo assembly of highly diverse viral populations , 2012, BMC Genomics.

[95]  David W Mount,et al.  Using hidden Markov models to align multiple sequences. , 2009, Cold Spring Harbor protocols.

[96]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[97]  R. Andino,et al.  Library preparation for highly accurate population sequencing of RNA viruses , 2014, Nature Protocols.

[98]  Lior Pachter,et al.  Viral Population Estimation Using Pyrosequencing , 2007, PLoS Comput. Biol..

[99]  Ion I. Mandoiu,et al.  Viral quasispecies reconstruction from amplicon 454 pyrosequencing reads , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[100]  Raj Acharya,et al.  Mutant-Bin: Unsupervised Haplotype Estimation of Viral Population Diversity Without Reference Genome , 2013, J. Comput. Biol..

[101]  James Theiler,et al.  Quantitative Deep Sequencing Reveals Dynamic HIV-1 Escape and Large Population Shifts during CCR5 Antagonist Therapy In Vivo , 2009, PloS one.

[102]  Luc Bijnens,et al.  ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering , 2015, BMC Bioinformatics.

[103]  T. Thomas,et al.  Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions , 2014, Microbial Informatics and Experimentation.

[104]  S. Brisse,et al.  Evaluation of High-Throughput Sequencing for Identifying Known and Unknown Viruses in Biological Samples , 2011, Journal of Clinical Microbiology.

[105]  Liang Ma,et al.  Towards Better Precision Medicine: PacBio Single-Molecule Long Reads Resolve the Interpretation of HIV Drug Resistant Mutation Profiles at Explicit Quasispecies (Haplotype) Level , 2015, Journal of data mining in genomics & proteomics.

[106]  Nancy R. Zhang,et al.  Ultrasensitive detection of rare mutations using next-generation targeted resequencing , 2011, Nucleic acids research.

[107]  Jeffrey A. Hussmann,et al.  High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing , 2013, Proceedings of the National Academy of Sciences.

[108]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[109]  R. Sanjuán,et al.  Extremely High Mutation Rate of HIV-1 In Vivo , 2015, PLoS biology.

[110]  Jaques Reifman,et al.  A quantitative quasispecies theory-based model of virus escape mutation under immune selection , 2012, Proceedings of the National Academy of Sciences.

[111]  Rui Jiang,et al.  Evaluation of next-generation sequencing software in mapping and assembly , 2011, Journal of Human Genetics.

[112]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[113]  M A Nowak,et al.  Antigenic diversity thresholds and the development of AIDS. , 1991, Science.

[114]  Masato Tashiro,et al.  Characterization of Quasispecies of Pandemic 2009 Influenza A Virus (A/H1N1/2009) by De Novo Sequencing Using a Next-Generation DNA Sequencer , 2010, PloS one.

[115]  D. Richman,et al.  Minority variants of drug-resistant HIV. , 2010, The Journal of infectious diseases.

[116]  Concetta Gardi,et al.  Cigarette Smoke Affects Keratinocytes SRB1 Expression and Localization via H2O2 Production and HNE Protein Adducts Formation , 2012, PloS one.

[118]  David L. Robertson,et al.  The Evolutionary Analysis of Emerging Low Frequency HIV-1 CXCR4 Using Variants through Time—An Ultra-Deep Approach , 2010, PLoS Comput. Biol..

[119]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[120]  Volker Roth,et al.  Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations , 2014, Nucleic acids research.

[121]  A. Wilm,et al.  LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets , 2012, Nucleic acids research.

[122]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[123]  Peter Schuster,et al.  A principle of natural self-organization , 1977, Naturwissenschaften.

[124]  M. Ronaghi,et al.  pyrosequencing : Application to HIV-1 drug resistance Characterization of mutation spectra with ultra-deep data , 2007 .

[125]  Gabor T. Marth,et al.  MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping , 2013, PloS one.

[126]  Michael Monsour,et al.  Minority HIV-1 Drug Resistance Mutations Are Present in Antiretroviral Treatment–Naïve Populations and Associate with Reduced Treatment Efficacy , 2008, PLoS medicine.

[127]  V. Calvez,et al.  Clonal analyses of HIV quasispecies in patients harbouring plasma genotype with K65R mutation associated with thymidine analogue mutations or L74V substitution , 2005, AIDS.

[128]  Karin J. Metzner,et al.  A Framework for Inferring Fitness Landscapes of Patient-Derived Viruses Using Quasispecies Theory , 2014, Genetics.

[129]  Eleazar Eskin,et al.  Accurate viral population assembly from ultra-deep sequencing data , 2014, Bioinform..

[130]  L. M. Mansky,et al.  Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase , 1995, Journal of virology.

[131]  Z. Bentwich,et al.  Evaluation of a Benchtop HIV Ultradeep Pyrosequencing Drug Resistance Assay in the Clinical Laboratory , 2013, Journal of Clinical Microbiology.

[132]  Matthias Cavassini,et al.  Minority quasispecies of drug-resistant HIV-1 that lead to early therapy failure in treatment-naive and -adherent patients. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.