Mutant-Bin: Unsupervised Haplotype Estimation of Viral Population Diversity Without Reference Genome

High genetic variability in viral populations plays an important role in disease progression, pathogenesis, and drug resistance. The last few years has seen significant progress in the development of methods for reconstruction of viral populations using data from next-generation sequencing technologies. These methods identify the differences between individual haplotypes by mapping the short reads to a reference genome. Much less has been published about resolving the population structure when a reference genome is lacking or is not well-defined, which severely limits the application of these new technologies to resolve virus population structure. We describe a computational framework, called Mutant-Bin, for clustering individual haplotypes in a viral population and determining their prevalence based on a set of deep sequencing reads. The main advantages of our method are that: (i) it enables determination of the population structure and haplotype frequencies when a reference genome is lacking; (ii) the method is unsupervised-the number of haplotypes does not have to be specified in advance; and (iii) it identifies the polymorphic sites that co-occur in a subset of haplotypes and the frequency with which they appear in the viral population. The method was evaluated on simulated reads with sequencing errors and 454 pyrosequencing reads from HIV samples. Our method clustered a high percentage of haplotypes with low false-positive rates, even at low genetic diversity.

[1]  Maxwell Young,et al.  Nonnegative integral subset representations of integer sets , 2007, Inf. Process. Lett..

[2]  Ion I. Mandoiu,et al.  Inferring viral quasispecies spectra from 454 pyrosequencing reads , 2011, BMC Bioinformatics.

[3]  K. Metzner,et al.  Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data , 2012, Front. Microbio..

[4]  Philip L. F. Johnson,et al.  Inference of population genetic parameters in metagenomics: a clean look at messy data. , 2006, Genome research.

[5]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[6]  F. Bushman,et al.  DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations , 2007, Nucleic acids research.

[7]  Giovanni Ulivi,et al.  Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing , 2011, BMC Bioinformatics.

[8]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[9]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[10]  Schraga Schwartz,et al.  Detection and Removal of Biases in the Analysis of Next-Generation Sequencing Reads , 2011, PloS one.

[11]  Allen G. Rodrigo,et al.  Computational and Evolutionary Analysis of HIV Molecular Sequences , 2001, Springer US.

[12]  Dorin Comaniciu,et al.  The Variable Bandwidth Mean Shift and Data-Driven Scale Selection , 2001, ICCV.

[13]  M S Waterman,et al.  Genomic mapping by end-characterized random clones: a mathematical analysis. , 1995, Genomics.

[14]  Lior Pachter,et al.  Viral Population Estimation Using Pyrosequencing , 2007, PLoS Comput. Biol..

[15]  Nebojsa Jojic,et al.  Population Sequencing Using Short Reads: HIV as a Case Study , 2008, Pacific Symposium on Biocomputing.

[16]  J. Margolick,et al.  Consistent Viral Evolutionary Changes Associated with the Progression of Human Immunodeficiency Virus Type 1 Infection , 1999, Journal of Virology.

[17]  Haixu Tang,et al.  A new approach to fragment assembly in DNA sequencing , 2001, RECOMB.

[18]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[19]  Volker Roth,et al.  Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction , 2009, RECOMB.

[20]  M. Ronaghi,et al.  Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. , 2007, Genome research.

[21]  Feng Gao,et al.  Diversity Considerations in HIV-1 Vaccine Selection , 2002, Science.

[22]  Volker Roth,et al.  Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction , 2009, RECOMB.

[23]  Gayle M. Wittenberg,et al.  EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data , 2010, J. Comput. Biol..

[24]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[25]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[26]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[27]  Yu Zhang,et al.  Calling SNPs without a reference sequence , 2010, BMC Bioinformatics.

[28]  L. Corey,et al.  Progression of human immunodeficiency virus type-1 infection after allogeneic marrow transplantation. , 1990, The American journal of medicine.

[29]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[31]  D. Nickle,et al.  Population Genetics of HIV: Parameter Estimation Using Genealogy-based Methods , 2002 .

[32]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[33]  M. Waterman,et al.  Estimating the repeat structure and length of DNA sequences using L-tuples. , 2003, Genome research.