Mining Non-overlapping Repetitive Sequential Patterns by ImprovingGSP Algorithm

Repetitive sequential patterns (RSP) mining plays very important roles and has been widely studied in DNA or genome, but there are only a few relevant approaches focusing on mining RSP from sequence database. Taking sequence <bcbcbcbca> for example, traditional sequential pattern mining algorithms only consider that <bc> appears at one time when calculating the support of <bc>, regardless of at least 4 times that <bc> appears within this same data sequence. Accordingly, to catch much more interesting sequential patterns, repetitive property needs to be involved during the mining process. However, currently the most relevant RSP methods focus on DNA analysis considering that they cannot be used for recognizing repetitive patterns on events sequences. Therefore, we propose an approach to determine the number of times a sequence repeatedly makes an appearance in a certain data sequence. The support value of a sequence could be more than 100% as this sequence might repeat in one data sequence, therefore we proposed a strategy to ensure the support range of repetitive sequence still within [0,100%]. Finally, we proposed an efficient algorithm, called RptGSP, to discover such repetitive sequential patterns based on improving GSP Algorithm. The experimental results reveal that RptGSP can efficiently discover the repetitive patterns.

[1]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[2]  Chao Liu,et al.  Efficient mining of iterative patterns for software specification discovery , 2007, KDD '07.

[3]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[4]  EunJu Lee,et al.  Efficient weighted mining of repetitive subsequences , 2009, 2009 1st IEEE Symposium on Web Society.

[5]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[6]  Jiawei Han,et al.  Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  Brooke L Heidenfelder,et al.  Effects of sequence on repeat expansion during DNA replication. , 2003, Nucleic acids research.

[8]  Jiawei Han,et al.  Efficient mining of partial periodic patterns in time series database , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[9]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[10]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[11]  Jinyan Li,et al.  Mining Iterative Generators and Representative Rules for Software Specification Discovery , 2011, IEEE Transactions on Knowledge and Data Engineering.

[12]  Kevin Y. Yip,et al.  Mining periodic patterns with gap requirement from sequences , 2007 .

[13]  Ma Chao,et al.  Clustering navigation patterns using closed repetitive gapped subsequence , 2010, 2010 International Conference on Logistics Systems and Intelligent Management (ICLSIM).

[14]  Zhao Li,et al.  Mining Compressed Repetitive Gapped Sequential Patterns Efficiently , 2009, ADMA.

[15]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[16]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.