论文信息 - Protein Sequence Pattern Mining with Constraints

Protein Sequence Pattern Mining with Constraints

Considering the characteristics of biological sequence databases, which typically have a small alphabet, a very long length and a relative small size (several hundreds of sequences), we propose a new sequence mining algorithm (gIL). gIL was developed for linear sequence pattern mining and results from the combination of some of the most efficient techniques used in sequence and itemset mining. The algorithm exhibits a high adaptability, yielding a smooth and direct introduction of various types of features into the mining process, namely the extraction of rigid and arbitrary gap patterns. Both breadth or a depth first traversal are possible. The experimental evaluation, in synthetic and real life protein databases, has shown that our algorithm has superior performance to state-of-the art algorithms. The use of constraints has also proved to be a very useful tool to specify user interesting patterns.

Paulo J. Azevedo | Pedro Gabriel Ferreira | P. Ferreira | P. Azevedo

[1] Johannes Gehrke,et al. Sequential PAttern mining using a bitmap representation , 2002, KDD.

[2] Aris Floratos,et al. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm [published erratum appears in Bioinformatics 1998;14(2): 229] , 1998, Bioinform..

[3] Mohammed J. Zaki. Sequence mining in categorical domains: incorporating constraints , 2000, CIKM '00.

[4] Umeshwar Dayal,et al. PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[5] H. van Attikum,et al. Yeast (Saccharomyces cerevisiae). , 2006, Methods in molecular biology.

[6] Qiming Chen,et al. PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[7] Mohammed J. Zaki,et al. SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[8] Ramakrishnan Srikant,et al. Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[9] G J Barton,et al. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.