Approximate String Matching for Searching DNA Sequences

—This paper presents a new algorithm for searching short fragments of sequences in long DNA sequences. A short sequence (pattern) is searched in both DNA strands with a given maximal value of errors. Each DNA sequence (T) is preprocessed by compressing it using Burrows-Wheeler transform and wavelet tree. First, the pattern is divided into short words which overlap themselves, and then their positions in T are determined using FM-index. Connections between the words are searched under the assumption of an acceptable maximal error allowed. Experimental results indicate that the algorithm is highly effective and it outperforms a popular Basic Local Alignment Search Tool (BLAST) in case of searching for short sequences.