Accelerating probabilistic frequent itemset mining: a model-based approach

Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain database induces an exponential number of possible worlds. To tackle this problem, we propose a novel method to capture the itemset mining process as a Poisson binomial distribution. This model-based approach extracts frequent itemsets with a high degree of accuracy, and supports large databases. We apply our techniques to improve the performance of the algorithms for: (1) finding itemsets whose frequentness probabilities are larger than some threshold; and (2) mining itemsets with the k highest frequentness probabilities. Our approaches support both tuple and attribute uncertainty models, which are commonly used to represent uncertain databases. Extensive evaluation on real and synthetic datasets shows that our methods are highly accurate. Moreover, they are orders of magnitudes faster than previous approaches.

[1]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[2]  Philip S. Yu,et al.  Approximate Frequent Itemset Mining In the Presence of Random Noise , 2008, Soft Computing for Knowledge Discovery and Data Mining.

[3]  C. Stein Approximate computation of expectations , 1986 .

[4]  A. Prasad Sistla,et al.  Querying the Uncertain Position of Moving Objects , 1997, Temporal Databases, Dagstuhl.

[5]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[6]  Sau Dan Lee,et al.  Decision Trees for Uncertain Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[7]  Dan Suciu,et al.  Towards correcting input data errors probabilistically using integrity constraints , 2006, MobiDE '06.

[8]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[9]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[10]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[11]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[12]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[13]  Charu C. Aggarwal,et al.  Frequent pattern mining with uncertain data , 2009, KDD.

[14]  Christian S. Jensen,et al.  Temporal Databases: Research and Practice , 1998, Lecture Notes in Computer Science.

[15]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[16]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[17]  Reynold Cheng,et al.  Naive Bayes Classification of Uncertain Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[18]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[19]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[20]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[21]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[22]  Parag Agrawal,et al.  Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS (Demo) , 2007, CIDR.

[23]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[24]  Man Hon Wong,et al.  Mining fuzzy association rules in databases , 1998, SGMD.

[25]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[26]  Yufei Tao,et al.  Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[27]  L. L. Cam,et al.  An approximation theorem for the Poisson binomial distribution. , 1960 .

[28]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[29]  Wilfred Ng,et al.  Mining Vague Association Rules , 2007, DASFAA.