Beyond Itemsets: Mining Frequent Featuresets over Structured Items

We assume a dataset of transactions generated by a set of users over structured items where each item could be described through a set of features. In this paper, we are interested in identifying the frequent featuresets (set of features) by mining item transactions. For example, in a news website, items correspond to news articles, the features are the named-entities/topics in the articles and an item transaction would be the set of news articles read by a user within the same session. We show that mining frequent featuresets over structured item transactions is a novel problem and show that straightforward extensions of existing frequent itemset mining techniques provide unsatisfactory results. This is due to the fact that while users are drawn to each item in the transaction due to a subset of its features, the transaction by itself does not provide any information about such underlying preferred features of users. In order to overcome this hurdle, we propose a featureset uncertainty model where each item transaction could have been generated by various featuresets with different probabilities. We describe a novel approach to transform item transactions into uncertain transaction over featuresets and estimate their probabilities using constrained least squares based approach. We propose diverse algorithms to mine frequent featuresets. Our experimental evaluation provides a comparative analysis of the different approaches proposed.

[1]  Vipin Kumar,et al.  Quantitative evaluation of approximate frequent pattern mining algorithms , 2008, KDD.

[2]  Charu C. Aggarwal,et al.  Frequent pattern mining with uncertain data , 2009, KDD.

[3]  Lise Getoor,et al.  Representing Tuple and Attribute Uncertainty in Probabilistic Databases , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[4]  Georgia Koutrika,et al.  On the selection of tags for tag clouds , 2011, WSDM '11.

[5]  J. Navarro-Pedreño Numerical Methods for Least Squares Problems , 1996 .

[6]  Hans-Peter Kriegel,et al.  Probabilistic frequent itemset mining in uncertain databases , 2009, KDD.

[7]  Raymond Reiter,et al.  A Theory of Diagnosis from First Principles , 1986, Artif. Intell..

[8]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[9]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[10]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[11]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[12]  Philip S. Yu,et al.  Mining Frequent Itemsets over Uncertain Databases , 2012, Proc. VLDB Endow..

[13]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[14]  Peter Damaschke,et al.  The union of minimal hitting sets: Parameterized combinatorial bounds and counting , 2007, J. Discrete Algorithms.

[15]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[16]  Edward Hung,et al.  Mining Frequent Itemsets from Uncertain Data , 2007, PAKDD.

[17]  Reynold Cheng,et al.  Mining uncertain data with probabilistic guarantees , 2010, KDD.

[18]  Thomas Hofmann,et al.  Learning What People (Don't) Want , 2001, ECML.

[19]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[20]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[21]  Carson Kai-Sang Leung,et al.  A Tree-Based Approach for Frequent Pattern Mining from Uncertain Data , 2008, PAKDD.

[22]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[23]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.