Mining Frequent Itemsets over Uncertain Databases

In recent years, due to the wide applications of uncertain data, mining frequent itemsets over uncertain databases has attracted much attention. In uncertain databases, the support of an itemset is a random variable instead of a fixed occurrence counting of this itemset. Thus, unlike the corresponding problem in deterministic databases where the frequent itemset has a unique definition, the frequent itemset under uncertain environments has two different definitions so far. The first definition, referred as the expected support-based frequent itemset, employs the expectation of the support of an itemset to measure whether this itemset is frequent. The second definition, referred as the probabilistic frequent itemset, uses the probability of the support of an itemset to measure its frequency. Thus, existing work on mining frequent itemsets over uncertain databases is divided into two different groups and no study is conducted to comprehensively compare the two different definitions. In addition, since no uniform experimental platform exists, current solutions for the same definition even generate inconsistent results. In this paper, we firstly aim to clarify the relationship between the two different definitions. Through extensive experiments, we verify that the two definitions have a tight connection and can be unified together when the size of data is large enough. Secondly, we provide baseline implementations of eight existing representative algorithms and test their performances with uniform measures fairly. Finally, according to the fair tests over many different benchmark data sets, we clarify several existing inconsistent conclusions and discuss some new findings.

[1]  Charu C. Aggarwal,et al.  Frequent pattern mining with uncertain data , 2009, KDD.

[2]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[3]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[4]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[5]  Sunil Prabhakar,et al.  Querying imprecise data in moving object environments , 2003, IEEE Transactions on Knowledge and Data Engineering.

[6]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[7]  Lei Chen,et al.  Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[8]  Roded Sharan,et al.  A direct comparison of protein interaction confidence assignment schemes , 2006, BMC Bioinformatics.

[9]  Yunhao Liu,et al.  Underground coal mine monitoring with wireless sensor networks , 2009, TOSN.

[10]  Hongjun Lu,et al.  H-mine: hyper-structure mining of frequent patterns in large databases , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[12]  K. P. Kaliyamurthie,et al.  Clustering Uncertain Data Using Voronoi Diagrams and R-Tree Index , 2014 .

[13]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[14]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[15]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[16]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[17]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[18]  Shaojie Tang,et al.  Canopy closure estimates with GreenOrbs: sustainable sensing in the forest , 2009, SenSys '09.

[19]  Philip S. Yu,et al.  Outlier Detection with Uncertain Data , 2008, SDM.

[20]  Ben Kao,et al.  A Decremental Approach for Mining Frequent Itemsets from Uncertain Data , 2008, PAKDD.

[21]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[22]  J. Pei,et al.  Outlier detection on uncertain data: Objects, instances, and inferences , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[23]  Toon Calders,et al.  Efficient Pattern Mining of Uncertain Data with Sampling , 2010, PAKDD.

[24]  Reynold Cheng,et al.  Accelerating probabilistic frequent itemset mining: a model-based approach , 2010, CIKM.

[25]  Carson Kai-Sang Leung,et al.  A Tree-Based Approach for Frequent Pattern Mining from Uncertain Data , 2008, PAKDD.

[26]  Lei Chen,et al.  On The Marriage of Lp-norms and Edit Distance , 2004, VLDB.

[27]  Toon Calders,et al.  Approximation of Frequentness Probability of Itemsets in Uncertain Data , 2010, 2010 IEEE International Conference on Data Mining.

[28]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[29]  Yunhao Liu,et al.  Passive diagnosis for wireless sensor networks , 2010, TNET.

[30]  L. L. Cam,et al.  An approximation theorem for the Poisson binomial distribution. , 1960 .

[31]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[32]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.