Distributed Sequential Pattern Mining in Large Scale Uncertain Databases

While sequential pattern mining SPM is an import application in uncertain databases, it is challenging in efficiency and scalability. In this paper, we develop a dynamic programming DP approach to mine probabilistic frequent sequential patterns in distributed computing platform Spark. Directly applying the DP method to Spark is impractical because its memory-consuming characteristic may cause heavy JVM garbage collection overhead in Spark. Therefore, we design a memory-efficient distributed DP approach and use an extended prefix-tree to save intermediate results efficiently. The extensive experimental results in various scales prove that our method is orders of magnitude faster than straight-forward approaches.

[1]  Ming-Syan Chen,et al.  Highly Scalable Sequential Pattern Mining Based on MapReduce Model on the Cloud , 2013, 2013 IEEE International Congress on Big Data.

[2]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[3]  Wilfred Ng,et al.  Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  Rajeev Raman,et al.  Mining sequential patterns from probabilistic databases , 2011, Knowledge and Information Systems.

[5]  Chengqi Zhang,et al.  Mining frequent serial episodes over uncertain sequence data , 2013, EDBT '13.

[6]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[7]  Wilfred Ng,et al.  Mining probabilistically frequent sequential patterns in uncertain databases , 2012, EDBT '12.

[8]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[9]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[10]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[11]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[12]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[13]  Yang Gao,et al.  A Comparative Study on Parallel LDA Algorithms in MapReduce Framework , 2015, PAKDD.

[14]  James Bailey,et al.  Mining Probabilistic Frequent Spatio-Temporal Sequential Patterns with Gap Constraints from Uncertain Databases , 2013, 2013 IEEE 13th International Conference on Data Mining.