A frame-based representation of genomic sequences for removing errors and rare variant detection in NGS data

We propose a frame-based representation of k-mers for detecting sequencing errors and rare variants in next generation sequencing data obtained from populations of closely related genomes. Frames are sets of non-orthogonal basis functions, traditionally used in signal processing for noise removal. We define a frame for genomes and sequenced reads to consist of discrete spatial signals of every k-mer of a given size. We show that each k-mer in the sequenced data can be projected onto multiple frames and these projections are maximized for spatial signals corresponding to the k-mer's substrings. Our proposed classifier, MultiRes, is trained on the projections of k-mers as features used for marking k-mers as erroneous or true variations in the genome. We evaluate MultiRes on simulated and real viral population datasets and compare it to other error correction methods known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs), fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is freely available from the GitHub link (this https URL).

[1]  Mohamed Abouelhoda,et al.  New insight into HCV E1/E2 region of genotype 4a , 2014, Virology Journal.

[2]  Eleazar Eskin,et al.  Accurate viral population assembly from ultra-deep sequencing data , 2014, Bioinform..

[3]  Elizabeth M. Ryan,et al.  De novo assembly of highly diverse viral populations , 2012, BMC Genomics.

[4]  Volker Roth,et al.  Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations , 2014, Nucleic acids research.

[5]  Michael Unser,et al.  Texture classification and segmentation using wavelet frames , 1995, IEEE Trans. Image Process..

[6]  T. Thomas,et al.  Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions , 2014, Microbial Informatics and Experimentation.

[7]  E. Domingo,et al.  Viral Quasispecies Evolution , 2012, Microbiology and Molecular Reviews.

[8]  K. Metzner,et al.  Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data , 2012, Front. Microbio..

[9]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[10]  Gerald Kaiser,et al.  A Friendly Guide to Wavelets , 1994 .

[11]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[12]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[13]  R. Duffin,et al.  A class of nonharmonic Fourier series , 1952 .

[14]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[15]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[16]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[17]  Xiao Yang,et al.  V-Phaser 2: variant inference for viral populations , 2013, BMC Genomics.

[18]  Marcel H. Schulz,et al.  Probabilistic error correction for RNA sequencing , 2013, Nucleic acids research.

[19]  M. Eigen,et al.  What is a quasispecies? , 2006, Current topics in microbiology and immunology.

[20]  I. Daubechies,et al.  PAINLESS NONORTHOGONAL EXPANSIONS , 1986 .

[21]  Heng Li,et al.  BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[22]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[23]  I. Daubechies,et al.  Framelets: MRA-based constructions of wavelet frames☆☆☆ , 2003 .

[24]  Alexander Schönhuth,et al.  Viral Quasispecies Assembly via Maximal Clique Enumeration , 2014, PLoS Comput. Biol..

[25]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[26]  A. Wilm,et al.  LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets , 2012, Nucleic acids research.

[27]  Paulo J. S. G. Ferreira,et al.  Mathematics for Multimedia Signal Processing II: Discrete Finite Frames and Signal Reconstruction , 1999 .

[28]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[29]  Sergey I. Nikolenko,et al.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[30]  E. Domingo,et al.  RNA virus mutations and fitness for survival. , 1997, Annual review of microbiology.

[31]  A. Ron,et al.  Frames and Stable Bases for Shift-Invariant Subspaces of L2(ℝd) , 1995, Canadian Journal of Mathematics.

[32]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[33]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[34]  Alexander Schönhuth,et al.  Viral Quasispecies Assembly via Maximal Clique Enumeration , 2014, RECOMB.

[35]  Pavel Skums,et al.  Efficient error correction for next-generation sequencing of viral amplicons , 2012, BMC Bioinformatics.

[36]  Volker Roth,et al.  Probabilistic Inference of Viral Quasispecies Subject to Recombination , 2013, J. Comput. Biol..

[37]  S. Mallat A wavelet tour of signal processing , 1998 .

[38]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[39]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.