Probabilistic frequent itemset mining over uncertain data streams

Abstract This paper considers the problem of mining probabilistic frequent itemsets in the sliding window of an uncertain data stream. We design an effective in-memory index named PFIT to store the data synopsis, so the current probabilistic frequent itemsets can be output in real time. We also propose a depth-first algorithm, PFIMoS, to bottom-up build and maintain the PFIT dynamically. Because computing the probabilistic support is time consuming, we propose a method to estimate the range of probabilistic support by using the support and expected support, which can greatly reduce the runtime and memory usage. Nevertheless, massive probabilistic supports have to be computed when the minimum support is low over dense data, which may result in a drastic reduction of computing speed. We further address this problem with a heuristic rule-based algorithm, PFIMoS+, in which an error parameter is introduced to decrease the probabilistic support computing count. Theoretical analysis and experimental studies demonstrate that our proposed algorithms can efficiently reduce computing time and memory, ensure fast and exact mining of probabilistic data streams, and markedly outperform the state-of-the-art algorithms TODIS-Stream (Sun et al., 2010) and FEMP (Akbarinia & Masseglia, 2013).

[1]  Ge Yu,et al.  An Efficient Method for Cleaning Dirty-Events over Uncertain Data in WSNs , 2011, Journal of Computer Science and Technology.

[2]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[3]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[4]  Lei Chen,et al.  Mining Frequent Itemsets in Correlated Uncertain Databases , 2015, Journal of Computer Science and Technology.

[5]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, Proc. VLDB Endow..

[6]  Hong Chen,et al.  FARP: Mining fuzzy association rules from a probabilistic quantitative database , 2013, Inf. Sci..

[7]  Toon Calders,et al.  Approximation of Frequentness Probability of Itemsets in Uncertain Data , 2010, 2010 IEEE International Conference on Data Mining.

[8]  Ying-Ho Liu,et al.  Mining frequent patterns from univariate uncertain data , 2012, Data Knowl. Eng..

[9]  Reza Akbarinia,et al.  Fast and Exact Mining of Probabilistic Data Streams , 2013, ECML/PKDD.

[10]  Carson Kai-Sang Leung,et al.  Efficient algorithms for mining constrained frequent patterns from uncertain data , 2009, U '09.

[11]  Carson Kai-Sang Leung,et al.  BLIMP: A Compact Tree Structure for Uncertain Frequent Pattern Mining , 2014, DaWaK.

[12]  Hans-Peter Kriegel,et al.  Probabilistic Frequent Pattern Growth for Itemset Mining in Uncertain Databases , 2010, SSDBM.

[13]  Tzung-Pei Hong,et al.  Efficiently mining uncertain high-utility itemsets , 2017, Soft Comput..

[14]  Carson Kai-Sang Leung,et al.  Frequent Pattern Mining from Time-Fading Streams of Uncertain Data , 2011, DaWaK.

[15]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[16]  Wilfred Ng,et al.  A survey on algorithms for mining frequent itemsets over data streams , 2008, Knowledge and Information Systems.

[17]  Charu C. Aggarwal,et al.  Frequent pattern mining with uncertain data , 2009, KDD.

[18]  Peiyi Tang,et al.  Mining probabilistic frequent closed itemsets in uncertain databases , 2011, ACM-SE '11.

[19]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[20]  Philip S. Yu,et al.  Mining Frequent Itemsets over Uncertain Databases , 2012, Proc. VLDB Endow..

[21]  Toon Calders,et al.  Efficient Pattern Mining of Uncertain Data with Sampling , 2010, PAKDD.

[22]  Zhang Xiaolin,et al.  Mining of Probabilistic Frequent Itemsets over Uncertain Data Streams , 2014, 2014 11th Web Information System and Application Conference.

[23]  Reynold Cheng,et al.  Efficient Mining of Frequent Item Sets on Large Uncertain Databases , 2012, IEEE Transactions on Knowledge and Data Engineering.

[24]  Carson Kai-Sang Leung,et al.  Mining of Frequent Itemsets from Streams of Uncertain Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[25]  Lei Chen,et al.  Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[26]  Alfredo Cuzzocrea,et al.  Computing Theoretically-Sound Upper Bounds to Expected Support for Frequent Pattern Mining Problems over Uncertain Big Data , 2016, IPMU.

[27]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[28]  Peiyi Tang,et al.  Fast approximation of probabilistic frequent closed itemsets , 2012, ACM-SE '12.

[29]  Themis Palpanas,et al.  Top-k Nearest Neighbor Search In Uncertain Data Series , 2014, Proc. VLDB Endow..

[30]  Ben Kao,et al.  A Decremental Approach for Mining Frequent Itemsets from Uncertain Data , 2008, PAKDD.

[31]  Alfredo Cuzzocrea,et al.  Discovering Frequent Patterns from Uncertain Data Streams with Time-Fading and Landmark Models , 2013, Trans. Large Scale Data Knowl. Centered Syst..

[32]  Chengqi Zhang,et al.  Summarizing probabilistic frequent patterns: a fast approach , 2013, KDD.