FANGS: high speed sequence mapping for next generation sequencers

Next Generation Sequencing machines are generating millions of short DNA sequences (reads) everyday. There is a need for efficient algorithms to map these sequences to the reference genome to identify SNPs or rare transcripts and to fulfill the dream of personalized medicine. We present a Fast Algorithm for Next Generation Sequencers (FANGS), which dynamically reduces the search space by using q-gram filtering and pigeon hole principle to rapidly map 454-Roche reads onto a reference genome. FANGS is a sequential algorithm designed to find all the matches of a query sequence in the reference genome tolerating a large number of mismatches or insertions/deletions. Using FANGS, we mapped 50000 reads with a total of 25 million nucleotides to the human genome in as little as 23.3 minutes on a typical desktop computer. Through our experiments, we found that FANGS is upto an order of magnitude faster than the state-of-the-art techniques for queries of length 500 allowing 5 mismatches or insertion/deletions.

[1]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[2]  Pavel A. Pevzner,et al.  Multiple filtration and approximate pattern matching , 1995, Algorithmica.

[3]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  Wing Hung Wong,et al.  SeqMap: mapping massive amount of oligonucleotides to the genome , 2008, Bioinform..

[6]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[7]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[8]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[9]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[10]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[11]  Catherine Shaffer Next-generation sequencing outpaces expectations , 2007, Nature Biotechnology.

[12]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[13]  K. Patrick,et al.  454 Life Sciences: Illuminating the future of genome sequencing and personalized medicine , 2007, The Yale journal of biology and medicine.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[16]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[17]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.