A Survey of Parallel Sequential Pattern Mining

With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this article, an in-depth survey of the current status of parallel SPM (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state-of-the art PSPM. We review the related work of PSPM in details including partition-based algorithms for PSPM, apriori-based PSPM, pattern-growth-based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages, and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative/weighted/utility SPM, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.

[1]  Keqiu Li,et al.  Efficient $k$ -Means++ Approximation with MapReduce , 2014, IEEE Trans. Parallel Distributed Syst..

[2]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[3]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[4]  J BayardoRoberto Efficiently mining long patterns from databases , 1998 .

[5]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[6]  Bin Li,et al.  A MapReduce Reinforced Distributed Sequential Pattern Mining Algorithm , 2015, ICA3PP.

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[9]  Jilles Vreeken,et al.  The long and the short of it: summarising event sequences with serial episodes , 2012, KDD.

[10]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[11]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[12]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[13]  Rainer Gemulla,et al.  DESQ: Frequent Sequence Mining with Subsequence Constraints , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[14]  Eli Upfal,et al.  PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce , 2012, CIKM.

[15]  Lazaros Gkatzikis,et al.  Migrate or not? exploiting dynamic task migration in mobile cloud computing systems , 2013, IEEE Wireless Communications.

[16]  Philip S. Yu,et al.  Privacy Preserving Utility Mining: A Survey , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[17]  Antonio Gomariz,et al.  VMSP: Efficient Vertical Mining of Maximal Sequential Patterns , 2014, Canadian Conference on AI.

[18]  Salvatore J. Stolfo,et al.  Adaptive Intrusion Detection: A Data Mining Approach , 2000, Artificial Intelligence Review.

[19]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[20]  Yichuan Jiang,et al.  Understanding Social Networks From a Multiagent Perspective , 2014, IEEE Transactions on Parallel and Distributed Systems.

[21]  Soon Myoung Chung,et al.  Parallel mining of maximal sequential patterns using multiple samples , 2010, The Journal of Supercomputing.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Mohammed J. Zaki Parallel and Distributed Data Mining: An Introduction , 1999, Large-Scale Parallel Data Mining.

[24]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[25]  Rainer Gemulla,et al.  LASH: Large-Scale Sequence Mining with Hierarchies , 2015, SIGMOD Conference.

[26]  Srinivasan Parthasarathy,et al.  Parallel Algorithms for Discovery of Association Rules , 1997, Data Mining and Knowledge Discovery.

[27]  Antonio Gomariz,et al.  The SPMF Open-Source Data Mining Library Version 2 , 2016, ECML/PKDD.

[28]  Marta Mattoso,et al.  Data Mining on Parallel Database Systems , 1998 .

[29]  Kyuseok Shim,et al.  SQUIRE: sequential pattern mining with quantities , 2004, Proceedings. 20th International Conference on Data Engineering.

[30]  DayalUmeshwar,et al.  Mining Sequential Patterns by Pattern-Growth , 2004 .

[31]  Ming-Syan Chen,et al.  Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud , 2013, 2013 IEEE International Congress on Big Data.

[32]  Bart Goethals,et al.  Sequence Mining Automata: A New Technique for Mining Frequent Sequences under Regular Expressions , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[33]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[34]  Bay Vo,et al.  Combination of dynamic bit vectors and transaction information for mining frequent closed sequences efficiently , 2015, Eng. Appl. Artif. Intell..

[35]  Sourav S. Bhowmick,et al.  Sequential Pattern Mining: A Survey , 2003 .

[36]  George Karypis,et al.  LPMiner: an algorithm for finding frequent itemsets using length-decreasing support constraint , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[37]  Ling Li,et al.  Distributed data mining: a survey , 2012, Inf. Technol. Manag..

[38]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[39]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[40]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[41]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[42]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules and sequential patterns , 1996 .

[43]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[44]  Lei Chen,et al.  Optimal Resource Placement in Structured Peer-to-Peer Networks , 2010, IEEE Transactions on Parallel and Distributed Systems.

[45]  Allan Gottlieb,et al.  Highly parallel computing , 1989, Benjamin/Cummings Series in computer science and engineering.

[46]  Tzung-Pei Hong,et al.  Classification based on association rules: A lattice-based approach , 2012, Expert Syst. Appl..

[47]  Václav Snásel,et al.  An efficient method for mining frequent sequential patterns using multi-Core processors , 2016, Applied Intelligence.

[48]  Soon Myoung Chung,et al.  Efficient Mining of Maximal Sequential Patterns Using Multiple Samples , 2005, SDM.

[49]  David A. Padua,et al.  Parallel mining of closed sequential patterns , 2005, KDD '05.

[50]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[51]  Cheng-Hung Lin,et al.  Accelerating Pattern Matching Using a Novel Parallel Algorithm on GPUs , 2013, IEEE Transactions on Computers.

[52]  Jeffrey Xu Yu,et al.  Scalable sequential pattern mining for biological sequences , 2004, CIKM '04.

[53]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[54]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[55]  Tzung-Pei Hong,et al.  Efficiently Maintaining the Fast Updated Sequential Pattern Trees With Sequence Deletion , 2014, IEEE Access.

[56]  Philip S. Yu,et al.  Differentially Private Data Publishing and Analysis: A Survey , 2017, IEEE Transactions on Knowledge and Data Engineering.

[57]  Shaojie Qiao,et al.  Parallel Sequential Pattern Mining of Massive Trajectory Data , 2010, Int. J. Comput. Intell. Syst..

[58]  Ming-Syan Chen,et al.  A General Model for Sequential Pattern Mining with a Progressive Database , 2008, IEEE Transactions on Knowledge and Data Engineering.

[59]  Jiming Liu,et al.  Agent-based load balancing on homogeneous minigrids: macroscopic modeling and characterization , 2005, IEEE Transactions on Parallel and Distributed Systems.

[60]  AgrawalRakesh,et al.  Mining association rules between sets of items in large databases , 1993 .

[61]  Yuni Xia,et al.  Distributed Sequential Pattern Mining in Large Scale Uncertain Databases , 2016, PAKDD.

[62]  Jiawei Han,et al.  IncSpan: incremental mining of sequential patterns in large database , 2004, KDD.

[63]  Jeremy Iverson,et al.  Big Data Frequent Pattern Mining , 2014, Frequent Pattern Mining.

[64]  Jinlin Chen,et al.  An UpDown Directed Acyclic Graph Approach for Sequential Pattern Mining , 2010, IEEE Transactions on Knowledge and Data Engineering.

[65]  Florent Masseglia,et al.  An efficient algorithm for Web usage mining , 1999 .

[66]  Allen D. Malony,et al.  PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[67]  Tzung-Pei Hong,et al.  Mining fuzzy sequential patterns from quantitative data , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[68]  Manohar Kaul,et al.  R-Apriori: An Efficient Apriori based Algorithm on Spark , 2015, PIKM@CIKM.

[69]  Yonggang Hu,et al.  Distributed and parallel high utility sequential pattern mining , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[70]  Charles A. Sutton,et al.  A Subsequence Interleaving Model for Sequential Pattern Mining , 2016, KDD.

[71]  Shiow-Yang Wu,et al.  Sequence-Growth: A Scalable and Effective Frequent Itemset Mining Algorithm for Big Data Based on MapReduce Framework , 2015, 2015 IEEE International Congress on Big Data.

[72]  Hong-Han Shuai,et al.  Distributed and scalable sequential pattern mining through stream processing , 2017, Knowledge and Information Systems.

[73]  Philippe Fournier-Viger,et al.  A survey of itemset mining , 2017, WIREs Data Mining Knowl. Discov..

[74]  Arbee L. P. Chen,et al.  An efficient algorithm for mining frequent sequences by a new strategy without support counting , 2004, Proceedings. 20th International Conference on Data Engineering.

[75]  Sabri Pllana,et al.  Accelerating DNA Sequence Analysis Using Intel(R) Xeon Phi(TM) , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[76]  Guangchen Ruan,et al.  Parallel and quantitative sequential pattern mining for large-scale interval-based temporal data , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[77]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[78]  Tzung-Pei Hong,et al.  RWFIM: Recent weighted-frequent itemsets mining , 2015, Eng. Appl. Artif. Intell..

[79]  R. Dolbeau,et al.  HMPP TM : A Hybrid Multi-core Parallel Programming Environment , 2022 .

[80]  Justin Zhijun Zhan,et al.  Data mining in distributed environment: a survey , 2017, WIREs Data Mining Knowl. Discov..

[81]  Valerie Guralnik,et al.  Parallel tree-projection-based sequence mining algorithms , 2004, Parallel Comput..

[82]  Andreas Mueller,et al.  Fast sequential and parallel algorithms for association rule mining: a comparison , 1995 .

[83]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[84]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[85]  Laurence T. Yang,et al.  Data Mining for Internet of Things: A Survey , 2014, IEEE Communications Surveys & Tutorials.

[86]  Elena Baralis,et al.  PaMPa-HD: A Parallel MapReduce-Based Frequent Pattern Miner for High-Dimensional Data , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[87]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[88]  Reza Akbarinia,et al.  A highly scalable parallel algorithm for maximally informative k-itemset mining , 2016, Knowledge and Information Systems.

[89]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[90]  John F. Roddick,et al.  Sequential pattern mining -- approaches and algorithms , 2013, CSUR.

[91]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[92]  Beng Chin Ooi,et al.  Distributed data management using MapReduce , 2014, CSUR.

[93]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[94]  Bart Goethals,et al.  Frequent Itemset Mining for Big Data , 2013, 2013 IEEE International Conference on Big Data.

[95]  Bay Vo,et al.  Mining sequential patterns with itemset constraints , 2018, Knowledge and Information Systems.

[96]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[97]  Daniel M. Pressel,et al.  GPUs: An Emerging Platform for General-Purpose Computation , 2007 .

[98]  David A. Padua,et al.  A sampling-based framework for parallel data mining , 2005, PPoPP.

[99]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[100]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[101]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[102]  Hamido Fujita,et al.  A survey of incremental high‐utility itemset mining , 2018, WIREs Data Mining Knowl. Discov..

[103]  Wei Wu,et al.  BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce , 2012, ICA3PP.

[104]  Vincent S. Tseng,et al.  Mining Maximal Sequential Patterns without Candidate Maintenance , 2013, ADMA.

[105]  Jian Pei,et al.  Mining Access Patterns Efficiently from Web Logs , 2000, PAKDD.

[106]  Zhenglu Yang,et al.  LAPIN: Effective Sequential Pattern Mining Algorithms by Last Position Induction for Dense Databases , 2007, DASFAA.

[107]  Yun Sing Koh,et al.  A Survey of Sequential Pattern Mining , 2017 .

[108]  Qing He,et al.  Distributed data mining in grid computing environments , 2007, Future Gener. Comput. Syst..

[109]  Nizar R. Mabroukeh,et al.  A taxonomy of sequential pattern mining algorithms , 2010, CSUR.

[110]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[111]  Pinar Senkul,et al.  CRoM and HuspExt: Improving Efficiency of High Utility Sequential Pattern Extraction , 2015, IEEE Transactions on Knowledge and Data Engineering.

[112]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[113]  Chih-Hung Wu,et al.  An empirical study on mining sequential patterns in a grid computing environment , 2012, Expert Syst. Appl..

[114]  Thomas Seidl,et al.  Towards a Mobile Health Context Prediction: Sequential Pattern Mining in Multiple Streams , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[115]  George Karypis,et al.  SLPMiner: an algorithm for finding frequent sequential patterns using length-decreasing support constraint , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[116]  Huaping Hu,et al.  HVSM: A New Sequential Pattern Mining Algorithm Using Bitmap Representation , 2005, ADMA.

[117]  Masaru Kitsuregawa,et al.  Mining Algorithms for Sequential Patterns in Parallel: Hash Based Approach , 1998, PAKDD.

[118]  Jian Wang,et al.  Mining Uncertain Sequential Patterns in Iterative MapReduce , 2015, PAKDD.

[119]  Mohammed J. Zaki Parallel Sequence Mining on Shared-Memory Machines , 1999, J. Parallel Distributed Comput..

[120]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[121]  Michelangelo Ceci,et al.  CloFAST: closed sequential pattern mining using sparse and vertical id-lists , 2016, Knowledge and Information Systems.

[122]  Yi Liu,et al.  PLWAP sequential mining: open source code , 2005 .

[123]  Tzung-Pei Hong,et al.  Efficient algorithms for mining up-to-date high-utility patterns , 2015, Adv. Eng. Informatics.

[124]  Jiawei Han,et al.  A fast distributed algorithm for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[125]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[126]  Tzung-Pei Hong,et al.  An efficient approach for finding weighted sequential patterns from sequence databases , 2014, Applied Intelligence.

[127]  Carolina Ruiz,et al.  FS-Miner: efficient and incremental mining of frequent sequence patterns in web logs , 2004, WIDM '04.

[128]  Muhammad Shiraz,et al.  Big Data: Survey, Technologies, Opportunities, and Challenges , 2014, TheScientificWorldJournal.

[129]  Philip S. Yu,et al.  A Survey of Utility-Oriented Pattern Mining , 2018, IEEE Transactions on Knowledge and Data Engineering.

[130]  Claude Sammut,et al.  Extracting Hidden Context , 1998, Machine Learning.

[131]  Changjie Tang,et al.  PartSpan: Parallel Sequence Mining of Trajectory Patterns , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[132]  Zhenglu Yang,et al.  LAPIN-SPAM: An Improved Algorithm for Mining Sequential Pattern , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[133]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[134]  Ayhan Demiriz,et al.  webSPADE: a parallel sequence mining algorithm to analyze web log data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[135]  Klara Nahrstedt,et al.  Optimal resource allocation in wireless ad hoc networks: a price-based approach , 2006, IEEE Transactions on Mobile Computing.

[136]  Tzung-Pei Hong,et al.  Maintaining the discovered sequential patterns for sequence insertion in dynamic databases , 2014, Eng. Appl. Artif. Intell..

[137]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[138]  Ming-Syan Chen,et al.  DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences , 2012, Knowledge and Information Systems.

[139]  Rajeev Raman,et al.  Mining sequential patterns from probabilistic databases , 2011, Knowledge and Information Systems.

[140]  Jilles Vreeken,et al.  Efficiently Summarising Event Sequences with Rich Interleaving Patterns , 2017, SDM.

[141]  Unil Yun,et al.  A new framework for detecting weighted sequential patterns in large sequence databases , 2008, Knowl. Based Syst..

[142]  Roque Marín,et al.  ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences , 2013, PAKDD.

[143]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[144]  Ming-Syan Chen,et al.  DPSP: Distributed Progressive Sequential Pattern Mining on the Cloud , 2010, PAKDD.

[145]  Kevin Skadron,et al.  Sequential pattern mining with the Micron automata processor , 2016, Conf. Computing Frontiers.

[146]  Philip S. Yu,et al.  Efficient parallel data mining for association rules , 1995, CIKM '95.

[147]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[148]  Panos Kalnis,et al.  Parallel motif extraction from very long sequences , 2013, CIKM.

[149]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[150]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[151]  G. Karypis,et al.  Parallel Algorithms for Mining Sequential Associations : Issues and Challenges , 2000 .

[152]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[153]  Klaus Berberich,et al.  Closing the Gap: Sequence Mining at Scale , 2015, TODS.

[154]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).