Approximate Query Answering with Frequent Sets and Maximum Entropy

To instantiate this scheme, we combine two general and useful techniques: the summary information provided by frequent sets [1] and the probabilistic estimation principle of maximum entropy [2]. Our approximation method is as follows. Given the table r, we rst compute the collection C of frequent sets of r for some suitable threshold . This is the summary of the data from which we compute the approximate answers. Given an arbitrary query ' over the table, let S be the set of attributes that occur in '. We nd from C all frequent sets that are included in S, and construct the maximum entropy distribution on S using those frequent sets as constraints. Then, we evaluate ' on the maximum entropy distribution and give the answer as the approximate answer. This approach is useful for any type of queries: as we construct the distribution on S, we can compute ' on that distribution regardless of what the actual form of ' is. The complexity of the method is independent of the size of the data (after the initial computation of the frequent sets), linear in the number of frequent sets contained in S, and exponential in j S j, the number of variables occurring in the query. Thus the method is useful for any size of data and very tolerant of the number of constraints. The main limitation of the method is the exponentiality in the number of variables occurring in the query, limiting its application in practice to queries involving no more than 10 variables.