Model-Independent Bounding of the Supports of Boolean Formulae in Binary Data

Data mining algorithms such as the Apriori method for finding frequent sets in sparse binary data can be used for efficient computation of a large number of summaries from huge data sets. The collection of frequent sets gives a collection of marginal frequencies about the underlying data set. Sometimes, we would like to use a collection of such marginal frequencies instead of the entire data set (e.g. when the original data is inaccessible for confidentiality reasons) to compute other interesting summaries. Using combinatorial arguments, we may obtain tight upper and lower bounds on the values of inferred summaries. In this paper, we consider a class of summaries wider than frequent sets, namely that of frequencies of arbitrary Boolean formulae. Given frequencies of a number of any different Boolean formulae, we consider the problem of finding tight bounds on the frequency of another arbitrary formula. We give a general formulation of the problem of bounding formula frequencies given some background information, and show how the bounds can be obtained by solving a linear programming problem. We illustrate the accuracy of the bounds by giving empirical results on real data sets.

[1]  Christophe Rigotti,et al.  A condensed representation to find frequent patterns , 2001, PODS '01.

[2]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[4]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[5]  Jean-François Boulicaut,et al.  Approximation of Frequency Queris by Means of Free-Sets , 2000, PKDD.

[6]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[7]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[8]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[9]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[10]  Micha Sharir,et al.  A subexponential bound for linear programming , 1992, SCG '92.

[11]  Toon Calders,et al.  Deducing Bounds on the Frequency of Itemsets , 2002 .

[12]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[13]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[14]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[15]  Heikki Mannila,et al.  Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets , 2000, UAI.

[16]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..