Probabilistic Inference of Viral Quasispecies Subject to Recombination

RNA viruses exist in their hosts as populations of different but related strains. The virus population, often called quasispecies, is shaped by a combination of genetic change and natural selection. Genetic change is due to both point mutations and recombination events. We present a jumping hidden Markov model that describes the generation of viral quasispecies and a method to infer its parameters from next-generation sequencing data. The model introduces position-specific probability tables over the sequence alphabet to explain the diversity that can be found in the population at each site. Recombination events are indicated by a change of state, allowing a single observed read to originate from multiple sequences. We present a specific implementation of the expectation maximization (EM) algorithm to find maximum a posteriori estimates of the model parameters and a method to estimate the distribution of viral strains in the quasispecies. The model is validated on simulated data, showing the advantage of explicitly taking the recombination process into account, and applied to reads obtained from a clinical HIV sample.

[1]  Shun-ichi Amari,et al.  Identifiability of hidden Markov information sources and their minimum degrees of freedom , 1992, IEEE Trans. Inf. Theory.

[2]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[3]  Lior Pachter,et al.  Viral Population Estimation Using Pyrosequencing , 2007, PLoS Comput. Biol..

[4]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[5]  Ron Shamir,et al.  GERBIL: Genotype resolution and block identification using likelihood. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Emese Meglécz,et al.  Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing , 2011, BMC Genomics.

[7]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[10]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[11]  M. Vignuzzi,et al.  Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population , 2006, Nature.

[12]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[13]  Niko Beerenwinkel,et al.  Ultra-deep sequencing for the analysis of viral populations. , 2011, Current opinion in virology.

[14]  Giovanni Ulivi,et al.  Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing , 2011, BMC Bioinformatics.

[15]  M. Nowak,et al.  Viral quasi-species and recombination , 1996, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[16]  M A Nowak,et al.  Antigenic diversity thresholds and the development of AIDS. , 1991, Science.

[17]  Sumio Watanabe,et al.  Algebraic geometry and stochastic complexity of hidden Markov models , 2005, Neurocomputing.

[18]  Volker Roth,et al.  Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction , 2009, RECOMB.

[19]  Ming Zhang,et al.  A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes , 2006, BMC Bioinformatics.

[20]  Jens Stoye,et al.  A Novel Approach to Remote Homology Detection: Jumping Alignments , 2002, J. Comput. Biol..

[21]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[22]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[23]  Michael Monsour,et al.  Minority HIV-1 Drug Resistance Mutations Are Present in Antiretroviral Treatment–Naïve Populations and Associate with Reduced Treatment Efficacy , 2008, PLoS medicine.