An Adaptive Suffix Tree Based Algorithm for Repeats Identification in a DNA Sequence

Many existing methods for repeats identification are based on alignments.Their speed and time significantly limit their applications.This paper presents the fast Rep(eats)Seeker algorithm for repeats identification based on the adaptive Ukkonen suffix tree construction algorithm.The RepSeeker algorithm uses the lowest frequency limit to maximize the extension of repeats.The adaptive improvements to the Ukkonen algorithm are made to increase the efficiency of the RepSeeker algorithm.The node information required by the RepSeeker algorithm is added during the suffix tree construction.Because information on leaves and branch nodes are different,the RepSeeker algorithm directly obtains the needed information from the nodes to find out the frequency and locate the positions of a substring.The improvement is considerable for the repeats identification at a little extra cost in space.Nine sequences from the National Center for Biotechnology Information (NCBI) are used to test the performance of the RepSeeker algorithm.Comparisons between before and after improvements of the suffix tree construction show that the running time of the RepSeeker algorithm is reduced without losing the accuracy.The experimental results agree with the theoretical expectations.