Fast Dynamic Programming Based Sequence Alignment Algorithm

Protein sequence alignment is basic operation mostly used in protein sequence analysis. The most optimal algorithm used in sequence alignment is based on the dynamic programming method. Smith-Waterman algorithm is the most commonly used dynamic programming based sequence alignment algorithm. However the algorithm uses quadratic time and space. Heuristic algorithm such as FASTA and BLAST were introduced to speed up the sequence alignment algorithm. FASTA is based on word search whereas BLAST is based on maximum segment pairs. In word search algorithm, lists of words from the query and database sequence are being compared to determine if two sequences have a region of sufficient similarity to merit further alignment using the Smith-Waterman Algorithm. All the different algorithms use the substitutions matrix based on the twenty alphabet amino acids. However research shows that reducing the number of amino acids to 10 does not affect the similarity measure. Our proposed algorithm uses the reduced amino acids alphabet to transform the protein sequences into a sequence of integer and uses n-gram to reduce the length of the sequence. Then the Smith-Waterman algorithm is used to get the similarity measure between two sequences. Result shows that the new proposed algorithm is as sensitive as the Smith-Waterman algorithm yet uses less space and performs better

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[3]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[4]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[5]  E. G. Shpaer,et al.  Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA. , 1996, Genomics.

[6]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[7]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  R Nussinov,et al.  Point mutations and sequence variability in proteins: Redistributions of preexisting populations , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Jun Wang,et al.  Reduction of protein sequence complexity by residue grouping. , 2003, Protein engineering.

[10]  D. Baker,et al.  Functional rapidly folding proteins from simplified amino acid sequences , 1997, Nature Structural Biology.

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[13]  Jun Wang,et al.  A computational approach to simplifying the protein folding alphabet , 1999, Nature Structural Biology.

[14]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.