Mining Frequent Patterns with Wildcards from Biological Sequences

Frequent pattern mining from sequences is a crucial step for many domain experts, such as molecular biologists, to discover rules or patterns hidden in their data. In order to find specific patterns, many existing tools require users to specify gap constraints beforehand. In reality, it is often nontrivial to let a user provide such gap constraints. In addition, a change made to the gap values may give completely different results, and require a separate time-consuming re-mining procedure. Consequently it is desirable to develop an algorithm to automatically and efficiently find patterns without user-specified gap constraints. In this paper, a frequent pattern mining problem without user-specified gap constraints is presented and studied. Given a sequence and a support threshold value, all subsequences whose support is not less than the given threshold value will be discovered. These frequent subsequences then form patterns later on. Two heuristic methods (one-way vs two-way scan) are proposed to mine frequent subsequences and estimate the maximum support for both artificial and real world data. Given a specific pattern, the simulated results demonstrate that the one-way scan heuristic performs better in the sense of estimating the maximum support with more than ninety percent accuracy.

[1]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[2]  Xindong Wu,et al.  Efficient string matching with wildcards and length constraints , 2006, Knowledge and Information Systems.

[3]  Ricardo A. Baeza-Yates,et al.  An Algorithm for String Matching with a Sequence of don't Cares , 1991, Inf. Process. Lett..

[4]  A. van Belkum,et al.  UvA-DARE ( Digital Academic Repository ) Variable number of tandem repeats in clinical strains of Haemophilus influenzae , 1997 .

[5]  Xindong Wu,et al.  Mining Complex Patterns across Sequences with Gap Requirements , 2007, IJCAI.

[6]  David Wai-Lok Cheung,et al.  Mining periodic patterns with gap requirement from sequences , 2005, SIGMOD '05.

[7]  M. Fischer,et al.  STRING-MATCHING AND OTHER PRODUCTS , 1974 .

[8]  Antoine Danchin,et al.  Genomes are covered with ubiquitous 11 bp periodic patterns, the "class A flexible patterns" , 2005, BMC Bioinformatics.

[9]  Piotr Indyk,et al.  Faster algorithms for string matching problems: matching the convolution bound , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[10]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[11]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[12]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[13]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[14]  Michaël Rusinowitch,et al.  Matching a Set of Strings with Variable Length don't Cares , 1995, Theor. Comput. Sci..

[15]  James Bailey,et al.  Mining minimal distinguishing subsequence patterns with gap constraints , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).