Probabilistic model based error correction in a set of various mutant sequences analyzed by next-generation sequencing

To analyze the evolutionary dynamics of a mutant population in an evolutionary experiment, it is necessary to sequence a vast number of mutants by high-throughput (next-generation) sequencing technologies, which enable rapid and parallel analysis of multikilobase sequences. However, the observed sequences include many errors of base call. Therefore, if next-generation sequencing is applied to analysis of a heterogeneous population of various mutant sequences, it is necessary to discriminate between true bases as point mutations and errors of base call in the observed sequences, and to subject the sequences to error-correction processes. To address this issue, we have developed a novel method of error correction based on the Potts model and a maximum a posteriori probability (MAP) estimate of its parameters corresponding to the "true sequences". Our method of error correction utilizes (1) the "quality scores" which are assigned to individual bases in the observed sequences and (2) the neighborhood relationship among the observed sequences mapped in sequence space. The computer experiments of error correction of artificially generated sequences supported the effectiveness of our method, showing that 50-90% of errors were removed. Interestingly, this method is analogous to a probabilistic model based method of image restoration developed in the field of information engineering.

[1]  Ion I. Mandoiu,et al.  Inferring viral quasispecies spectra from 454 pyrosequencing reads , 2011, BMC Bioinformatics.

[2]  Simona Soverini,et al.  Comparison of Next-Generation Sequencing Systems , 2013 .

[3]  Wesley E. Snyder,et al.  Mean field annealing: a formalism for constructing GNC-like algorithms , 1992, IEEE Trans. Neural Networks.

[4]  Volker Roth,et al.  Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction , 2009, RECOMB.

[5]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[6]  A. Ferré-D’Amaré,et al.  Rapid Construction of Empirical RNA Fitness Landscapes , 2010, Science.

[7]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[8]  Gayle M. Wittenberg,et al.  EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data , 2010, J. Comput. Biol..

[9]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[10]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[11]  Yutaka Suzuki,et al.  Recount: expectation maximization based error correction tool for next generation sequencing data. , 2009, Genome informatics. International Conference on Genome Informatics.

[12]  Jonathan Michael Pryce Statistical mechanics of image restoration , 1995 .

[13]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[14]  Kazuyuki Tanaka,et al.  Cluster variation method and image restoration problem , 1995 .

[15]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[16]  Pavel Skums,et al.  Efficient error correction for next-generation sequencing of viral amplicons , 2012, BMC Bioinformatics.

[17]  Zhenyu Zhou,et al.  Approximate maximum likelihood hyperparameter estimation for Gibbs priors , 1995, Proceedings., International Conference on Image Processing.

[18]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[19]  Jun-ichi Inoue,et al.  Maximum Likelihood Hyperparameter Estimation for Solvable Markov Random Field Model in Image Restoration , 2002 .

[20]  N. Beerenwinkel,et al.  Accurate single nucleotide variant detection in viral populations by combining probabilistic clustering with a statistical test of strand bias , 2013, BMC Genomics.

[21]  T. Yomo,et al.  Darwinian evolution in a translation-coupled RNA replication system within a cell-like compartment , 2013, Nature Communications.

[22]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[23]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[24]  Jeffrey E. Barrick,et al.  Genome evolution and adaptation in a long-term experiment with Escherichia coli , 2009, Nature.

[25]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..

[26]  Alice Carolyn McHardy,et al.  Allele dynamics plots for the study of evolutionary dynamics in viral populations , 2010, Nucleic Acids Res..

[27]  Jeffrey E. Barrick,et al.  Repeatability and Contingency in the Evolution of a Key Innovation in Phage Lambda , 2012, Science.

[28]  M. Eigen,et al.  Statistical geometry on sequence space. , 1990, Methods in enzymology.

[29]  Giovanni Ulivi,et al.  Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing , 2011, BMC Bioinformatics.

[30]  Ilya Nemenman,et al.  Genotype to Phenotype Mapping and the Fitness Landscape of the E. coli lac Promoter , 2012, PloS one.

[31]  Kazufumi Hosoda,et al.  Replication of Genetic Information with Self‐Encoded Replicase in Liposomes , 2008, ChemBioChem.

[32]  A. Betancourt Genomewide Patterns of Substitution in Adaptively Evolving Populations of the RNA Bacteriophage MS2 , 2009, Genetics.

[33]  Srinivas Aluru,et al.  Repeat-aware modeling and correction of short read errors , 2011, BMC Bioinformatics.