Handling biological sequence alignments on networked computing systems: A divide-and-conquer approach

In this paper, we address the biological sequence alignment problem, which is one of the most commonly used steps in several bioinformatics applications. We employ the Divisible Load Theory (DLT) paradigm that is suitable for handling large-scale processing on network-based systems to achieve a high degree of parallelism. Using the DLT paradigm, we propose a strategy in which we carefully partition the computation work load among the processors in the system so as to minimize the overall computation time of determining the maximum similarity between the DNA/protein sequences. We consider handling such a computational problem on networked computing platforms connected as a linear daisy chain. We derive the individual load quantum to be assigned to the processors according to computation and communication link speeds along the chain. We consider two cases of sequence alignment where post-processes, i.e., trace-back processes that are required to determine an optimal alignment, may or may not be done at individual processors in the system. We derive some critical conditions to determine if our strategies are able to yield an optimal processing time. We apply three different heuristic strategies proposed in the literature to generate sub-optimal solutions for processing time when the above conditions cannot be satisfied. To testify the proposed schemes, we use real-life DNA samples of house mouse mitochondrion and the DNA of human mitochondrion obtained from the public database GenBank [GenBank, http://www.ncbi.nlm.nih.gov] in our simulation experiments. By this study, we conclusively demonstrate the applicability and potential of the DLT paradigm to such biological sequence related computational problems.

[1]  Wu-chun Feng Green Destiny + mpiBLAST = Bioinfomagic , 2003, PARCO.

[2]  Bharadwaj Veeravalli,et al.  Aligning biological sequences on distributed bus networks: a divisible load scheduling approach , 2005, IEEE Transactions on Information Technology in Biomedicine.

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[6]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[7]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[8]  David A. Bader,et al.  On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures , 2007, J. Parallel Distributed Comput..

[9]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[10]  Debasish Ghose,et al.  Load Partitioning and Trade-Off Study for Large Matrix-Vector Computations in Multicast Bus Networks with Communication Delays , 1998, J. Parallel Distributed Comput..

[11]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[12]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[13]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[14]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[15]  M. Gribskov,et al.  Sequence Analysis Primer , 1991 .

[16]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[17]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[18]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[19]  Peter J. Munson,et al.  A novel randomized iterative strategy for aligning multiple protein sequences , 1991, Comput. Appl. Biosci..

[20]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[21]  Srinivas Aluru,et al.  PARALLEL-TCOFFEE: A parallel multiple sequence aligner , 2007, PDCS.

[22]  Ophir Frieder,et al.  High Performance Computational Methods for Biological Sequence Analysis , 1996, Springer US.

[23]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  Chee Keong Kwoh,et al.  Parallel DNA Sequence Alignment on the Cell Broadband Engine , 2007, PPAM.

[26]  Ophir Frieder,et al.  Parallel computation in biological sequence analysis , 1998 .

[27]  Debasish Ghose,et al.  Scheduling Divisible Loads in Parallel and Distributed Systems , 1996 .

[28]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[29]  Zhiyong Xu,et al.  Scheduling real-time multimedia tasks in network processors , 2004, IEEE Global Telecommunications Conference, 2004. GLOBECOM '04..

[30]  P. Bourne,et al.  The New Biology and the Grid , 2003 .

[31]  Hagit Attiya,et al.  Wiley Series on Parallel and Distributed Computing , 2004, SCADA Security: Machine Learning Concepts for Intrusion Detection and Prevention.

[32]  William R. Taylor,et al.  Protein bioinformatics - an algorithmic approach to sequence and structure analysis , 2004 .