Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction

Highly mutable RNA viruses such as influenza A virus, human immunodeficiency virus and hepatitis C virus exist in infected hosts as highly heterogeneous populations of closely related genomic variants. The presence of low-frequency variants with few mutations with respect to major strains may result in an immune escape, emergence of drug resistance, and an increase of virulence and infectivity. Next-generation sequencing technologies permit detection of sample intra-host viral population at extremely great depth, thus providing an opportunity to access low-frequency variants. Long read lengths offered by single-molecule sequencing technologies allow all viral variants to be sequenced in a single pass. However, high sequencing error rates limit the ability to study heterogeneous viral populations composed of rare, closely related variants. In this article, we present CliqueSNV, a novel reference-based method for reconstruction of viral variants from NGS data. It efficiently constructs an allele graph based on linkage between single nucleotide variations and identifies true viral variants by merging cliques of that graph using combinatorial optimization techniques. The new method outperforms existing methods in both accuracy and running time on experimental and simulated NGS data for titrated levels of known viral variants. For PacBio reads, it accurately reconstructs variants with frequency as low as 0.1%. For Illumina reads, it fully reconstructs main variants. The open source implementation of CliqueSNV is freely available for download at https://github.com/vyacheslav-tsivina/CliqueSNV

[1]  Pavel Skums,et al.  Antigenic cooperation among intrahost HCV variants organized into a complex network of cross-immunoreactivity , 2015, Proceedings of the National Academy of Sciences.

[2]  Pavel Skums,et al.  Error Correction of NGS Reads from Viral Populations , 2016 .

[3]  P. Kilmarx,et al.  Global epidemiology of HIV , 2009, Current opinion in HIV and AIDS.

[4]  Sergei L. Kosakovsky Pond,et al.  The global transmission network of HIV-1. , 2014, The Journal of infectious diseases.

[5]  E. Domingo,et al.  Viral Quasispecies Evolution , 2012, Microbiology and Molecular Reviews.

[6]  Ion I. Mandoiu,et al.  Estimation of alternative splicing isoform frequencies from RNA-Seq data , 2010, Algorithms for Molecular Biology.

[7]  G. Dore,et al.  Epidemiology and natural history of HCV infection , 2013, Nature Reviews Gastroenterology &Hepatology.

[8]  Nancy R. Zhang,et al.  Ultrasensitive detection of rare mutations using next-generation targeted resequencing , 2011, Nucleic acids research.

[9]  Pavel Skums,et al.  Drug-resistance of a viral population and its individual intra-host variants during the first 48 hours of therapy , 2014, Clinical pharmacology and therapeutics.

[10]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[11]  Xiao Yang,et al.  V-Phaser 2: variant inference for viral populations , 2013, BMC Genomics.

[12]  David S. Campo,et al.  Accurate Genetic Detection of Hepatitis C Virus Transmissions in Outbreak Settings. , 2016, The Journal of infectious diseases.

[13]  Eleazar Eskin,et al.  Accurate viral population assembly from ultra-deep sequencing data , 2014, Bioinform..

[14]  A. Weiner,et al.  Hepatitis C virus (HCV) circulates as a population of different but closely related genomes: quasispecies nature of HCV genome distribution , 1992, Journal of virology.

[15]  Olivier Harismendy,et al.  Detection of low prevalence somatic mutations in solid tumors with ultra-deep targeted sequencing , 2011, Genome Biology.

[16]  Volker Roth,et al.  Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations , 2014, Nucleic acids research.

[17]  Jakub Kov'avc,et al.  Complexity of the path avoiding forbidden pairs problem revisited , 2011, 1111.3996.

[18]  Michael Werman,et al.  A Unified Approach to the Change of Resolution: Space and Gray-Level , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[20]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[21]  Bernadette A. Thomas,et al.  Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: a systematic analysis for the Global Burden of Disease Study 2010 , 2012, The Lancet.

[22]  J. Drake,et al.  Mutation rates among RNA viruses. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[23]  M. Buti,et al.  Quasispecies structure, cornerstone of hepatitis B virus infection: mass sequencing approach. , 2013, World journal of gastroenterology.

[24]  Haris Vikalo,et al.  aBayesQR: A Bayesian method for reconstruction of viral populations characterized by low diversity , 2017, bioRxiv.

[25]  Alexander Schönhuth,et al.  De novo assembly of viral quasispecies using overlap graphs , 2017, bioRxiv.

[26]  Ion I. Mandoiu,et al.  Reconstruction of viral population structure from next-generation sequencing data using multicommodity flows , 2013, BMC Bioinformatics.

[27]  G. Fedonin,et al.  Evaluating the accuracy and sensitivity of detecting minority HIV-1 populations by Illumina next-generation sequencing. , 2018, Journal of virological methods.

[28]  Susan P. Holmes,et al.  HIV-1 Subtype B Protease and Reverse Transcriptase Amino Acid Covariation , 2007, PLoS Comput. Biol..

[29]  Eleazar Eskin,et al.  Long single-molecule reads can resolve the complexity of the Influenza virus composed of rare, closely related mutant variants , 2016, bioRxiv.

[30]  Thomas Lengauer,et al.  Data and text mining Computational methods for the design of effective therapies against drug resistant HIV strains , 2005 .

[31]  H. Vikalo,et al.  Viral Quasispecies Reconstruction via Correlation Clustering , 2016, bioRxiv.

[32]  Piotr Berman,et al.  HCV Quasispecies Assembly Using Network Flows , 2008, ISBRA.

[33]  Alexander Schönhuth,et al.  Viral Quasispecies Assembly via Maximal Clique Enumeration , 2014, PLoS Comput. Biol..

[34]  Volker Roth,et al.  HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  John E. Johnson,et al.  CoVaMa: Co-Variation Mapper for disequilibrium analysis of mutant loci in viral populations using next-generation sequence data. , 2015, Methods.

[36]  Jan Albert,et al.  Population genomics of intrapatient HIV-1 evolution , 2015, eLife.

[37]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..

[38]  Ion I. Mandoiu,et al.  Reconstructing viral quasispecies from NGS amplicon reads , 2012, Silico Biol..

[39]  Jeroen Aerssens,et al.  VirVarSeq: a low-frequency virus variant detection pipeline for Illumina sequencing using adaptive base-calling accuracy filtering , 2015, Bioinform..

[40]  Zheng Zhang,et al.  SARS-associated coronavirus quasispecies in individual patients. , 2004, The New England journal of medicine.

[41]  Ion I. Mandoiu,et al.  Inferring viral quasispecies spectra from 454 pyrosequencing reads , 2011, BMC Bioinformatics.

[42]  J. Holland,et al.  Rapid evolution of RNA viruses. , 1987, Annual review of microbiology.

[43]  MingKun Li,et al.  Genomic diversity of SARS-CoV-2 in Coronavirus Disease 2019 patients , 2020, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[44]  Feng Gao,et al.  Diversity Considerations in HIV-1 Vaccine Selection , 2002, Science.

[45]  Dan Nettleton,et al.  SimSeq: a nonparametric approach to simulation of RNA-sequence datasets , 2015, Bioinform..

[46]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[47]  J. Holland,et al.  RNA virus populations as quasispecies. , 1992, Current topics in microbiology and immunology.

[48]  Volker Roth,et al.  Probabilistic Inference of Viral Quasispecies Subject to Recombination , 2012, RECOMB.

[49]  C. Bleidorn Third generation sequencing: technology and its potential impact on evolutionary biodiversity research , 2016 .

[50]  Daniel C. Douek,et al.  The Rational Design of an AIDS Vaccine , 2006, Cell.

[51]  Pavel Skums,et al.  Efficient error correction for next-generation sequencing of viral amplicons , 2012, BMC Bioinformatics.

[52]  M. Eigen,et al.  The molecular quasi-species , 2007 .

[53]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[54]  J. P. Davis,et al.  Transmission of Hepatitis C Virus Associated with Surgical Procedures — New Jersey 2010 and Wisconsin 2011 , 2015, MMWR. Morbidity and mortality weekly report.

[55]  C. Mallows A Note on Asymptotic Joint Normality , 1972 .

[56]  P. Patel,et al.  Hepatitis C virus infections from unsafe injection practices at an endoscopy clinic in Las Vegas, Nevada, 2007-2008. , 2010, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[57]  Ekaterina Gerasimov Analysis of NGS Data from Immune Response and Viral Samples , 2017 .

[58]  Elizabeth M. Ryan,et al.  De novo assembly of highly diverse viral populations , 2012, BMC Genomics.