Summarizing probabilistic frequent patterns: a fast approach

Mining probabilistic frequent patterns from uncertain data has received a great deal of attention in recent years due to the wide applications. However, probabilistic frequent pattern mining suffers from the problem that an exponential number of result patterns are generated, which seriously hinders further evaluation and analysis. In this paper, we focus on the problem of mining probabilistic representative frequent patterns (P-RFP), which is the minimal set of patterns with adequately high probability to represent all frequent patterns. Observing the bottleneck in checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of the supports of two patterns, we introduce a novel approximation of the joint probability with both theoretical and empirical proofs. Based on the approximation, we propose an Approximate P-RFP Mining (APM) algorithm, which effectively and efficiently compresses the set of probabilistic frequent patterns. To our knowledge, this is the first attempt to analyze the relationship between two probabilistic frequent patterns through an approximate approach. Our experiments on both synthetic and real-world datasets demonstrate that the APM algorithm accelerates P-RFP mining dramatically, orders of magnitudes faster than an exact solution. Moreover, the error rate of APM is guaranteed to be very small when the database contains hundreds transactions, which further affirms APM is a practical solution for summarizing probabilistic frequent patterns.

[1]  H. Wold,et al.  Some Theorems on Distribution Functions , 1936 .

[2]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[3]  Yang Xiang,et al.  Effective and efficient itemset pattern summarization: regression-based approaches , 2008, KDD.

[4]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[5]  Reynold Cheng,et al.  Accelerating probabilistic frequent itemset mining: a model-based approach , 2010, CIKM.

[6]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[7]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[8]  Rabi Bhattacharya,et al.  Continuity Correction , 2011, International Encyclopedia of Statistical Science.

[9]  Guimei Liu,et al.  Finding minimum representative pattern sets , 2012, KDD.

[10]  Peiyi Tang,et al.  Mining probabilistic frequent closed itemsets in uncertain databases , 2011, ACM-SE '11.

[11]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[12]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[13]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[14]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[15]  Charu C. Aggarwal,et al.  Frequent pattern mining with uncertain data , 2009, KDD.

[16]  Toon Calders,et al.  Approximation of Frequentness Probability of Itemsets in Uncertain Data , 2010, 2010 IEEE International Conference on Data Mining.

[17]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[18]  Philip S. Yu,et al.  Mining Frequent Itemsets over Uncertain Databases , 2012, Proc. VLDB Endow..

[19]  Carson Kai-Sang Leung,et al.  A Tree-Based Approach for Frequent Pattern Mining from Uncertain Data , 2008, PAKDD.

[20]  Peiyi Tang,et al.  Fast approximation of probabilistic frequent closed itemsets , 2012, ACM-SE '12.

[21]  Vivekanand Gopalkrishnan,et al.  CP-summary: a concise representation for browsing frequent itemsets , 2009, KDD.

[22]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[23]  Lei Chen,et al.  Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[24]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[25]  Chengqi Zhang,et al.  Mining Probabilistic Representative Frequent Patterns From Uncertain Data , 2013, SDM.

[26]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.