Comparing the Speed and Accuracy of the Smith and Waterman Algorithm as Implemented by Mpsrch with the Blast and Fasta Heuristics for Sequence Similarity Searching

INTRODUCTION. Similarity searching is used to identify homologies between a query sequence and sequences in a database to elucidate the function of the former by considering the latter. Similarity searching (or more appropriately, dissimilarity searching) is also used in oligomer design, which involves the identification of a unique N-mer (N < 100) to represent a gene for microarray and other assays. The sensitivity of the search is a measure of how well an algorithm can locate all related or matching sequences in the database. The BLAST heuristic is probably the most widely used sequence matching method today due primarily to its availability on public servers with graphical interfaces (such as the one at NCBI) and its speed[1,5]. Many commercial versions are available that are accelerated in some manner. The FASTA heuristic is also used although it is slower than BLAST because it is more sensitive[2,4]. Both of these methods are based on approximations that aggregate the sequence into tokens prior to the search to reduce the computational complexity (i.e., decrease the time to search). The Smith-Waterman algorithm is an exhaustive search based on Bellman's dynamic programming algorithm and is therefore the most sensitive (and historically slowest) of the three methods[3]. In fact, once the approximate methods of BLAST and FASTA have produced sites of potential alignment, it is often the Smith-Waterman that is used to calculate the actual alignment. MPSRCH is an implementation of the Smith-Waterman algorithm that exploits the capabilities of the processor hardware to increase the speed of the algorithm to level similar to BLAST or FASTA. It has been implemented on the Compaq Alpha, the Intel Pentium, and the Motorola PowerPC. METHOD. The sensitivity of these three algorithms is evaluated systematically against a database of proteins or nucleic acids. Each algorithm is tested by selecting one of the genes in the database as the query sequence using the default settings of the algorithm, and by varying the settings to improve the performance. RESULTS. The BLAST algorithm is the least sensitive and occasionally fails to find the query sequence known to be in the database. The FASTA algorithm will also occasionally fail to find matches produced by MPSRCH. Such a failure can be