Parallel Pair-HMM SNP Detection

I. MOTIVATION: Due to the massive amounts of data generated from each instrument run, next generation sequencing technologies have presented researchers with unique analytical challenges which require innovative, computationally efficient statistical solutions. Here we present a parallel implementation of a probabilistic Pair-Hidden Markov Model for base calling and SNP detection in next generation sequencing data. Our approach incorporates multiple sources of error into the base calling procedure which leads to more accurate results. In addition, our approach applies a likelihood ratio test that provides researchers with straight-forward SNP calling cutoffs based on a p-value cutoff or a false discovery control. II. RESULTS: We have developed GNUMAP-SNP, which is a highly accurate approach for the identification of SNPs in next generation sequencing data. By utilizing a novel probabilistic Pair-Hidden Markov Model, GNUMAP-SNP effectively accounts for uncertainty in the read calls as well as read mapping in an unbiased fashion. Our results show that GNUMAP-SNP has both high sensitivity and high specificity throughout the genome, which is especially true in repeat regions or in areas with low read coverage. In addition, we propose a statistical framework that accounts for the background noise using straightforward statistical cutoffs which filters out false-positive results. The parallel implementation of SNP calling achieves near linear speedup on distributed memory or shared memory platforms. III. AVAILABILITY: GNUMAP-SNP is available as a module in the GNUMAP probabilistic read mapping software. GNUMAP is freely available for download at: http://dna.cs.byu.edu/gnumap/.

[1]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[2]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[3]  Quinn Snell,et al.  Accelerated large-scale multiple sequence alignment , 2011, BMC Bioinformatics.

[4]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[5]  Sean R. Eddy,et al.  Biological sequence analysis: Pairwise alignment using HMMs , 1998 .

[6]  H. Hakonarson,et al.  Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. , 2011, American journal of human genetics.

[7]  Mark J. Clement,et al.  The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing , 2010, Bioinform..

[8]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[9]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[10]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[12]  Mark Gerstein,et al.  Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants , 2009, PLoS Comput. Biol..

[13]  Mark J. Clement,et al.  Parallel Mapping Approaches for GNUMAP , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[14]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[15]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[16]  Daniel H. Huson,et al.  48. MetaSim: A Sequencing Simulator for Genomics and Metagenomics , 2011 .

[17]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[18]  Simon Cawley,et al.  Applications of generalized pair hidden Markov models to alignment and gene finding problems , 2001, J. Comput. Biol..

[19]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.