Highly scalable and accurate seeds for subsequence alignment

We propose a method for finding seeds for the local alignment of two nucleotide sequences. Our method uses randomized algorithms to find approximate seeds. We present a dynamic index to store the fingerprints of k-grams and a highly scalable and accurate (HSA) algorithm to incorporate randomization into process of seed generation. Experimental results show that our method produces better quality seeds with improved running time and memory usage compared to traditional non-spaced and spaced seeds. The presented algorithm scales very well with higher seed lengths while maintaining the quality and performance.

[1]  Louxin Zhang,et al.  Good spaced seeds for homology search , 2004, Bioinform..

[2]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[3]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[4]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[5]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[6]  Inna Dubchak,et al.  Glocal alignment: finding rearrangements during alignment , 2003, ISMB.

[7]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[8]  Michael Brudno,et al.  Fast and sensitive alignment of large genomic sequences , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[9]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[10]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[11]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[12]  Pankaj Agarwal,et al.  Compact Encoding Strategies for DNA Sequence Similarity Search , 1996, ISMB.

[13]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[15]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[17]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[18]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[19]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[20]  Manindra Agrawal,et al.  PRIMES is in P , 2004 .

[21]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[22]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[23]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[24]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.