Fast identification of repetitive elements in biological sequences.
暂无分享,去创建一个
We have developed a fast filtering method for searching repetitive sequences in databases that allows the simultaneous identification of different families of repetitive elements during the same scanning. It discriminates between repetitive elements and non-related sequences by comparing the frequencies of k-words found in both groups of sequences. The distance used to sort out the sequences is based on a weighting of the k-words, which is obtained by performing a correspondence analysis on learning sets of correctly chosen sequences. The identification of Alu elements in human sequences is given as an illustration of the method. The Alu sequences are divided in four distinct groups of elements: the left and right monomers located on the direct and on the complementary strands. The results obtained on the test sets show that a very good discrimination is achieved with a word length of 6 b.p. Indeed, only 0.5% of the non-Alu sequences were incorrectly predicted as Alu elements for a threshold value allowing the identification of all Alu monomers. The misclassification of the different Alu monomers (1.4%) in the four groups of examples occurs only when the left and the right monomers are in the same orientation. Moreover, during the scanning of 63 GenBank sequences longer than 10 Kb, all the Alu elements were correctly identified (616 elements) and only a few non-Alu sequences were wrongly predicted as Alu elements (22 fragments). There is a real need for this kind of method since most of the repetitive elements are not annotated in the database entries. This method can then be used for a systematic screening of new sequences before their insertion in databases. It can also allow the creation of specific databases devoted to repetitive elements, which is a required step for any further analysis of those elements.