pFANGS: Parallel high speed sequence mapping for Next Generation 454-roche Sequencing reads

Millions of DNA sequences (reads) are generated by Next Generation Sequencing machines everyday. There is a need for high performance algorithms to map these sequences to the reference genome to identify single nucleotide polymorphisms or rare transcripts to fulfill the dream of personalized medicine. In this paper, we present a high-throughput parallel sequence mapping program pFANGS. pFANGS is designed to find all the matches of a query sequence in the reference genome tolerating a large number of mismatches or insertions/deletions. pFANGS partitions the computational workload and data among all the processes and employs load-balancing mechanisms to ensure better process efficiency. Our experiments show that, with 512 processors, we are able to map approximately 31 million 454/Roche queries of length 500 each to a reference human genome per hour allowing 5 mismatches or insertion/deletions at full sensitivity. We also report and compare the performance results of two alternative parallel implementations of pFANGS: a shared memory OpenMP implementation and a MPI-OpenMP hybrid implementation.

[1]  K. Patrick,et al.  454 Life Sciences: Illuminating the future of genome sequencing and personalized medicine , 2007, The Yale journal of biology and medicine.

[2]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[3]  Catalin C. Barbacioru,et al.  Parallel short sequence mapping for high throughput genome sequencing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  Nagiza F. Samatova,et al.  Efficient data access for parallel BLAST , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[6]  Alok N. Choudhary,et al.  FANGS: high speed sequence mapping for next generation sequencers , 2010, SAC '10.

[7]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[8]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[9]  David R. Mathog,et al.  Parallel BLAST on split databases , 2003, Bioinform..

[10]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[13]  John V. Carlis,et al.  Efficiency of shared-memory multiprocessors for a genetic sequence similarity search algorithm , 1996 .

[14]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[15]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[16]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[18]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[19]  Robert D. Bjornson,et al.  TurboBLAST : a parallel implementation of blast built on the turbohub , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[20]  Catherine Shaffer Next-generation sequencing outpaces expectations , 2007, Nature Biotechnology.

[21]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[22]  Thomas L. Casavant,et al.  Parallelization of local BLAST service on workstation clusters , 2001, Future Gener. Comput. Syst..

[23]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[24]  Pavel A. Pevzner,et al.  Multiple filtration and approximate pattern matching , 1995, Algorithmica.

[25]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[26]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[27]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[28]  Wing Hung Wong,et al.  SeqMap: mapping massive amount of oligonucleotides to the genome , 2008, Bioinform..