USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

An important problem in computational biology is the alignment of a given query sequence and sequences in a database to find similar (locally or globally) sequences from the database to the query. Many heuristic algorithms for this problem are based on the idea of locating a fixed-length matching pair of substrings (called a seed) to start an alignment, and then extending this alignment using dynamic programming. We generalize this idea and take it one step further in a tool we develop, namely Sequence Comparison Tool (SCT). SCT preprocesses the database to create a special generalized suffix tree from the sequences in the database. This tree extends the definition of a generalized suffix tree by containing additional information at the nodes for the length and frequency (number of occurrences) of the corresponding substrings (patterns). A pattern is regarded as significant if it is sufficiently long and it appears many times in the database. A significant pattern shared by two sequences is an indication that the sequences are locally similar. SCT ranks the sequences with respect to the number of significant patterns they share with the query sequence. SCT reduces the database by selecting a given number of sequences with the topmost ranks. It proceeds with invoking an ordinary local alignment algorithm on this reduced database. We conducted experiments on real biological sequences, and compared SCT's performance with a popular alignment tool BLAST. In these tests we used the 6-fold cross validation technique of data mining. The tests show that SCT effectively reduces the database and obtains very similar results compared to those of BLAST in approximately half the time taken by BLAST.