A domain decomposition strategy for alignment of multiple biological sequences on multiprocessor platforms

Multiple Sequences Alignment (MSA) of biological sequences is a fundamental problem in computational biology due to its critical significance in wide ranging applications including haplotype reconstruction, sequence homology, phylogenetic analysis, and prediction of evolutionary origins. The MSA problem is considered NP-hard and known heuristics for the problem do not scale well with increasing numbers of sequences. On the other hand, with the advent of a new breed of fast sequencing techniques it is now possible to generate thousands of sequences very quickly. For rapid sequence analysis, it is therefore desirable to develop fast MSA algorithms that scale well with an increase in the dataset size. In this paper, we present a novel domain decomposition based technique to solve the MSA problem on multiprocessing platforms. The domain decomposition based technique, in addition to yielding better quality, gives enormous advantages in terms of execution time and memory requirements. The proposed strategy allows one to decrease the time complexity of any known heuristic of O(N)^x complexity by a factor of O(1/p)^x, where N is the number of sequences, x depends on the underlying heuristic approach, and p is the number of processing nodes. In particular, we propose a highly scalable algorithm, Sample-Align-D, for aligning biological sequences using Muscle system as the underlying heuristic. The proposed algorithm has been implemented on a cluster of workstations using the MPI library. Experimental results for different problem sizes are analyzed in terms of quality of alignment, execution time and speed-up.

[1]  Shahid H. Bokhari,et al.  A Partitioning Strategy for Nonuniform Problems on Multiprocessors , 1987, IEEE Transactions on Computers.

[2]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[3]  Michael Kaufmann,et al.  DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors , 2004, BMC Bioinformatics.

[4]  J. S. Huang,et al.  Parallel sorting and data partitioning by sampling , 1983 .

[5]  Yue Lu,et al.  A Polynomial Time Solvable Formulation of Multiple Sequence Alignment , 2005, RECOMB.

[6]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[7]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[8]  Donald Geman,et al.  Large-scale integration of cancer microarray data identifies a robust common cancer signature , 2007, BMC Bioinformatics.

[9]  Melissa S. Cline,et al.  Predicting reliable regions in protein sequence alignments , 2002, Bioinform..

[10]  Robert C. Edgar,et al.  Local homology recognition and distance measures in linear time using compressed amino acid alphabets. , 2004, Nucleic acids research.

[11]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[12]  RankaSanjay,et al.  Partitioning Unstructured Computational Graphs for Nonuniform and Adaptive Environments , 1995 .

[13]  Surin Kittitornkun,et al.  MT-ClustalW: multithreading multiple sequence alignment , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[14]  Michael J. Quinn,et al.  Non-uniform 2-D grid partitioning for heterogeneous parallel architectures , 1995, Proceedings of 9th International Parallel Processing Symposium.

[15]  Srinivas Aluru,et al.  PARALLEL-TCOFFEE: A parallel multiple sequence aligner , 2007, PDCS.

[16]  Yue Lu,et al.  A Polynomial Time Solvable Formulation of Multiple Sequence Alignment , 2005, RECOMB.

[17]  Burkhard Morgenstern,et al.  DIALIGN: multiple DNA and protein sequence alignment at BiBiServ , 2004, Nucleic Acids Res..

[18]  R. Spang,et al.  Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.

[19]  Susanne E. Hambrusch,et al.  Communication Operations on Coarse-Grained Mesh Architectures , 1995, Parallel Comput..

[20]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[21]  Erik L L Sonnhammer,et al.  Quality assessment of multiple alignment programs , 2002, FEBS letters.

[22]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[23]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[24]  Niko Beerenwinkel,et al.  Multiple Sequence Alignment System for Pyrosequencing Reads , 2009, BICoB.

[25]  M. Ronaghi Pyrosequencing sheds light on DNA sequencing. , 2001, Genome research.

[26]  Fa Zhang,et al.  A parallel Smith-Waterman algorithm based on divide and conquer , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[27]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[28]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[29]  Jonathan Schaeffer,et al.  Parallel Sorting by Regular Sampling , 1992, J. Parallel Distributed Comput..

[30]  E. Li,et al.  Parallel implementation and performance characterization of MUSCLE , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[31]  Andrei Tchernykh,et al.  Parallel multiple sequence alignment with local phylogeny search by simulated annealing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[32]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[33]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[34]  Ashfaq A. Khokhar,et al.  Sample-Align-D: A high performance Multiple Sequence Alignment system using phylogenetic sampling and domain decomposition , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[35]  Cédric Notredame,et al.  Recent Evolutions of Multiple Sequence Alignment Algorithms , 2007, PLoS Comput. Biol..

[36]  Scott B. Baden,et al.  Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves , 1996, IEEE Trans. Parallel Distributed Syst..

[37]  Sanjay Ranka,et al.  Partitioning unstructured computational graphs for nonunifor , 1995, IEEE Parallel & Distributed Technology: Systems & Applications.

[38]  Andrew Rau-Chaplin,et al.  Parallel CLUSTAL W for PC Clusters , 2003, ICCSA.

[39]  Shahid H. Bokhari,et al.  Binary Dissection: Variants & Applications , 1997 .

[40]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[41]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[42]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[43]  Susanne E. Hambrusch,et al.  C3: A Parallel Model for Coarse-Grained Machines , 1996, J. Parallel Distributed Comput..

[44]  Lior Pachter,et al.  Multiple alignment by sequence annealing , 2007, Bioinform..

[45]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[46]  Roberto Gomperts,et al.  Performance Optimization of Clustal W : Parallel Clustal W , HT Clustal , and MULTICLUSTAL , 2001 .