Efficient Algorithms for Mining Maximal Frequent Concatenate Sequences in Biological Datasets

The growth of bioinformatics has resulted in datasets with new characteristics. The DNA sequences typically contain a large number of items. From them biologists assemble a whole genome of species based on frequent concatenate sequences, which ordinarily have hundreds of items. Such datasets pose a great challenge for existing frequent pattern discovery algorithms. Almost all of them are Apriori-like and so have an exponential dependence on the average sequence length. PrefixSpan is the most efficient algorithm, which presented the projection-based sequential pattern-growth approach. However it grows sequential patterns by exploring length-1 frequent patterns and so is not suitable for biological dataset with long frequent concatenate sequences. In this paper, we propose two novel algorithms, called MacosFSpan and MacosVSpan, to mine maximal frequent concatenate sequences. They are specially designed to handle datasets having long frequent concatenate sequences. Our performance study shows that MacosFSpan outperforms the traditional methods with length-1 sequences exploration and MacosVSpan is more efficient than Macos VSpan

[1]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[2]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[3]  Jiawei Han,et al.  Efficient mining of partial periodic patterns in time series database , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[4]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[5]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules and sequential patterns , 1996 .

[6]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[7]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[8]  Sridhar Ramaswamy,et al.  Cyclic association rules , 1998, Proceedings 14th International Conference on Data Engineering.

[9]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[10]  Sushil Jajodia,et al.  Mining Temporal Relationships with Multiple Granularities in Time Sequences , 1998, IEEE Data Eng. Bull..

[11]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[12]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.