Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA.

To predict the functions of a possible protein product of any new or uncharacterized DNA sequence, it is important first to detect all significant similarities between the encoded amino acid sequence and any accumulated protein sequence data. We have implemented a set of queries and database sequences and proceeded to test and compare various similarity search methods and their parameterizations. We demonstrate here that the Smith-Waterman (S-W) dynamic programming method and the optimized version of FASTA are significantly better able to distinguish true similarities from statistical noise than is the popular database search tool BLAST. Also, a simple "log-length normalization" of S-W scores based on the query and target sequence lengths greatly increased the selectivity of the S-W searches, exceeding the default normalization method of FASTA. An implementation of the modified S-W algorithm in hardware (the Fast Data Finder) is able to match the accuracy of software versions while greatly speeding up its execution. We present here the selectivity and sensitivity data from these tests as well as results for various scoring matrices. We present data that will help users to choose threshold score values for evaluation of database search results. We also illustrate the impact of using simple-sequence masking tools such as SEG or XNU.