Approximating a collection of frequent sets

One of the most well-studied problems in data mining is computing the collection of frequent item sets in large transactional databases. One obstacle for the applicability of frequent-set mining is that the size of the output collection can be far too large to be carefully examined and understood by the users. Even restricting the output to the border of the frequent item-set collection does not help much in alleviating the problem.In this paper we address the issue of overwhelmingly large output size by introducing and studying the following problem: What are the k sets that best approximate a collection of frequent item sets? Our measure of approximating a collection of sets by k sets is defined to be the size of the collection covered by the the k sets, i.e., the part of the collection that is included in one of the k sets. We also specify a bound on the number of extra sets that are allowed to be covered. We examine different problem variants for which we demonstrate the hardness of the corresponding problems and we provide simple polynomial-time approximation algorithms. We give empirical evidence showing that the approximation methods work well in practice.

[1]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[2]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[3]  Dorit S. Hochba,et al.  Approximation Algorithms for NP-Hard Problems , 1997, SIGA.

[4]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[6]  Randeep Bhatia,et al.  Book review: Approximation Algorithms for NP-hard Problems. Edited by Dorit S. Hochbaum (PWS, 1997) , 1998, SIGA.

[7]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[8]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[9]  Carla E. Brodley,et al.  KDD-Cup 2000 organizers' report: peeling the onion , 2000, SKDD.

[10]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[11]  Jian Pei,et al.  On computing condensed frequent pattern bases , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[12]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[13]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[14]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[15]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .