A parallel strategy for biological sequence alignment in restricted memory space

Recently, many organisms had their DNA entirely sequenced, and this reality presents the need for aligning long DNA sequences, which is a challenging task due to its high demands for computational power and memory. The algorithm proposed by Smith-Waterman (SW) is an exact method that obtains optimal local alignments in quadratic space and time. For long sequences, quadratic complexity makes the use of this algorithm impractical. In this scenario, parallel computing is a very attractive alternative. In this paper, we propose and evaluate z-align, a parallel exact strategy based on the divergence concept to locally align long biological sequences using an affine gap function. Z-align runs in limited memory space, where the amount of memory used can be defined by the user. The results collected in a cluster with 16 processors presented very good speedups for long real DNA sequences. With z-align, we were able to compare up to 3MBP (mega base-pairs) DNA sequences. As far as we know, this is the first time 3MBP sequences are compared with an affine gap exact variation of the SW algorithm. Also, by comparing the results obtained with z-align and the popular BLAST tool, it is clear that z-align is able to produce longer and more significant alignments.

[1]  Bertil Schmidt,et al.  Computing large-scale alignments on a multi-cluster , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  Jonathan Schaeffer,et al.  FastLSA: A Fast, Linear-Space, Parallel and Sequential Algorithm for Sequence Alignment , 2006, Algorithmica.

[4]  James W. Fickett,et al.  Fast optimal alignment , 1984, Nucleic Acids Res..

[5]  Alba Cristina Magalhaes Alves de Melo,et al.  Comparing Two Long Biological Sequences Using a DSM System , 2003, Euro-Par.

[6]  Fa Zhang,et al.  A parallel Smith-Waterman algorithm based on divide and conquer , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[7]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[8]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[9]  Srinivas Aluru,et al.  Space and time optimal parallel sequence alignments , 2004, IEEE Transactions on Parallel and Distributed Systems.

[10]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[11]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[12]  Azzedine Boukerche,et al.  Parallel Smith-Waterman Algorithm for Local DNA Comparison in a Cluster of Workstations , 2005, WEA.

[13]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  S. Altschul,et al.  Optimal sequence alignment using affine gap costs. , 1986, Bulletin of mathematical biology.

[16]  Gregory Francis Pfister,et al.  In search of clusters: the coming battle in lowly parallel computing , 1995 .

[17]  D. Szafron,et al.  Sequence Alignment using FastLSA , 2000 .

[18]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[19]  A. Apostolio,et al.  A Fast Linear Space Algorithm for Computing Longest Common Subsequences , 1985 .

[20]  Luis Carlos Trevelin 12th symposium on computer architecture and high performance computing , 2000 .

[21]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[22]  Srinivas Aluru,et al.  Parallel biological sequence comparison using prefix computations , 2003, J. Parallel Distributed Comput..