Super-sequence frequent pattern mining on sequential dataset

Due to the importance of Frequent Pattern Mining (FPM) in bioinformatics, web mining, social networks and so on, researchers have been paying significant attention to FPM and its various forms. In this study, we introduce a new form that we call super-sequence pattern mining. In contrast to frequent sub-sequence pattern mining studied significantly in the literature, frequent super-sequence mining requires to identify super-sequences that may contain sequential parts from different sequences and that have the total support larger than a given threshold. In essence, finding frequent super-sequence patterns turns out to be related to the well-known NP-hard longest path problem in graphs. Accordingly, we transform a given sequential dataset into a sequence graph and formulate the problem as k-hop longest path problem. We then propose a heuristic algorithm using dynamic programming techniques. The running time of our solution is depending on the number of different items in the sequence set but not on the size of the dataset. Through experiments, we demonstrate the effectiveness of the proposed solution. We also illustrate its use on an actual web log dataset and find out some interesting facts based on the identified frequent super-sequences on the web log dataset.

[1]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[2]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[3]  Liu Zhijing,et al.  Web mining research , 2003, Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003.

[4]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[5]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[6]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[7]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[8]  Jaideep Srivastava,et al.  Grouping Web page references into transactions for mining World Wide Web browsing patterns , 1997, Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop.

[9]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[10]  István Vajk,et al.  Frequent Pattern Mining in Web Log Data , 2006 .

[11]  Philip S. Yu,et al.  Mining Colossal Frequent Patterns by Core Pattern Fusion , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[12]  Shraddha Savaliya,et al.  An Effective Hash-Based Algorithm for Mining Association Rules , 2015 .

[13]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[14]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[15]  Aditya P. Mathur,et al.  A Survey of Malware Detection Techniques , 2007 .

[16]  Stephanie Forrest,et al.  Intrusion Detection Using Sequences of System Calls , 1998, J. Comput. Secur..

[17]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[18]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[19]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[20]  David R. Karger,et al.  On approximating the longest path in a graph , 1997, Algorithmica.

[21]  Goto Shigeki,et al.  An Improved Intrusion Detecting Method Based on Process Profiling , 2002 .

[22]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[23]  Klemens Böhm,et al.  Proceedings of the International Conference on Very Large Data Bases , 2005 .

[24]  Caspar Zialor DNA sequencing with chain terminating inhibitors , 2014 .

[25]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[26]  Stavros D. Nikolopoulos,et al.  The Longest Path Problem has a Polynomial Solution on Interval Graphs , 2011, Algorithmica.

[27]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[28]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[29]  Stavros D. Nikolopoulos,et al.  The Longest Path Problem is Polynomial on Cocomparability Graphs , 2010, WG.

[30]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[31]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[32]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[33]  Vangelis Th. Paschos,et al.  The probabilistic longest path problem , 1999, Networks.

[34]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[35]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[36]  Ryuhei Uehara,et al.  Longest Path Problems on Ptolemaic Graphs , 2008, IEICE Trans. Inf. Syst..