Whole Genome Comparison using Commodity Workstations

—Whole genome comparison consists of comparing or aligning two genome sequences in the hope that analogous functional or physical characteristics may be observed. Sequence comparison is done via a number of slow rigorous algorithms, or faster heuristic approaches. However, due to the large size of genomic sequences, the capacity of current software is limited. In this work, we design a parallel-distributed system for the Smith-Waterman dynamic programming sequence comparison algorithm. We use subword parallelism to speedup sequence to sequence comparison using Streaming SIMD Extensions (SSE) on Intel Pentium processors. We compare two approaches, one requiring explicit data dependency handling and the other built to automatically handle dependencies. We achieve a speedup of 10-30 and establish the optimum conditions for each approach. We then implement a scalable and fault-tolerant distributed version of the genome comparison process on a network of workstations based on a static work allocation algorithm. We achieve speeds upwards of 8000 MCUPS on 64 workstations, one of the fastest implementations of the Smith-Waterman algorithm.

[1]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[2]  Alfred V. Aho,et al.  Bounds on the Complexity of the Longest Common Subsequence Problem , 1976, J. ACM.

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[5]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[6]  Volker Strumpen Parallel Molecular Sequence Analysis on Workstations in the Internet , 1993 .

[7]  Bowen Alpern,et al.  Microparallelism and High-Performance Protein Matching , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[8]  Richard Hughey,et al.  Parallel hardware for sequence comparison and alignment , 1996, Comput. Appl. Biosci..

[9]  Ruby B. Lee Multimedia extensions for general-purpose processors , 1997, 1997 IEEE Workshop on Signal Processing Systems. SiPS 97 Design and Implementation formerly VLSI Signal Processing.

[10]  Andrzej Wozniak,et al.  Using video-oriented instructions to speed up sequence comparison , 1997, Comput. Appl. Biosci..

[11]  Uri C. Weiser,et al.  Intel MMX for multimedia PCs , 1997, Commun. ACM.

[12]  Ophir Frieder,et al.  Parallel computation in biological sequence analysis , 1998 .

[13]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[14]  Guang R. Gao,et al.  A Multithreaded Parallel Implementation of a Dynamic Programming Algorithm for Sequence Comparison , 2000, Pacific Symposium on Biocomputing.

[15]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .