Ehapp2: Estimate haplotype frequencies from pooled sequencing data with prior database information

To reduce the cost of large-scale re-sequencing, multiple individuals are pooled together and sequenced called pooled sequencing. Pooled sequencing could provide a cost-effective alternative to sequencing individuals separately. To facilitate the application of pooled sequencing in haplotype-based diseases association analysis, the critical procedure is to accurately estimate haplotype frequencies from pooled samples. Here we present Ehapp2 for estimating haplotype frequencies from pooled sequencing data by utilizing a database which provides prior information of known haplotypes. We first translate the problem of estimating frequency for each haplotype into finding a sparse solution for a system of linear equations, where the NNREG algorithm is employed to achieve the solution. Simulation experiments reveal that Ehapp2 is robust to sequencing errors and able to estimate the frequencies of haplotypes with less than 3% average relative difference for pooled sequencing of mixture of real Drosophila haplotypes with 50× total coverage even when the sequencing error rate is as high as 0.05. Owing to the strategy that proportions for local haplotypes spanning multiple SNPs are accurately calculated first, Ehapp2 retains excellent estimation for recombinant haplotypes resulting from chromosomal crossover. Comparisons with present methods reveal that Ehapp2 is state-of-the-art for many sequencing study designs and more suitable for current massive parallel sequencing.

[1]  C. Schlötterer,et al.  Sequencing pools of individuals — mining genome-wide polymorphism data without big funding , 2014, Nature Reviews Genetics.

[2]  A. Futschik,et al.  The Next Generation of Molecular Markers From Massively Parallel Sequencing of Pooled DNA Samples , 2010, Genetics.

[3]  Benjamin J. Wright,et al.  Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease , 2009, Nature Genetics.

[4]  John Novembre,et al.  Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data , 2012, Molecular biology and evolution.

[5]  Matti Pirinen,et al.  Estimating Haplotype Frequencies by Combining Data from Large DNA Pools with Database Information , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[7]  Matti Pirinen,et al.  Estimating population haplotype frequencies from pooled SNP data using incomplete database information , 2009, Bioinform..

[8]  Kai Ye,et al.  PoolHap: Inferring Haplotype Frequencies from Pooled Samples by Next Generation Sequencing , 2011, PloS one.

[9]  T. Niu Algorithms for inferring haplotypes , 2004, Genetic epidemiology.

[10]  Simon Foucart,et al.  Sparse Recovery by Means of Nonnegative Least Squares , 2014, IEEE Signal Processing Letters.

[11]  Amnon Amir,et al.  Bacterial Community Reconstruction Using A Single Sequencing Reaction , 2010, ArXiv.

[12]  Ohad Shamir,et al.  Accurate Profiling of Microbial Communities from Massively Parallel Sequencing Using Convex Optimization , 2013, SPIRE.

[13]  Xiaodong Wang,et al.  Fast and accurate haplotype frequency estimation for large haplotype vectors from pooled DNA data , 2012, BMC Genetics.

[14]  Nick Patterson,et al.  Combinatorics and next-generation sequencing , 2009, Nature Biotechnology.

[15]  M Mancuso,et al.  Genome-wide haplotype association study identifies the FRMD4A gene as a risk locus for Alzheimer's disease , 2012, Molecular Psychiatry.

[16]  T. Cezard,et al.  Estimation of population allele frequencies from next‐generation sequencing data: pool‐versus individual‐based genotyping , 2013, Molecular ecology.

[17]  Saad Mneimneh,et al.  Crossing Over…Markov Meets Mendel , 2012, PLoS Comput. Biol..

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[20]  Kevin R. Thornton,et al.  The Drosophila melanogaster Genetic Reference Panel , 2012, Nature.

[21]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[22]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[23]  Xiao Sun,et al.  Accurate estimation of haplotype frequency from pooled sequencing data and cost-effective identification of rare haplotype carriers by overlapping pool sequencing , 2015, Bioinform..

[24]  Gail L. Rosen,et al.  Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing , 2013, Bioinform..

[25]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.