Probabilistic Models for Query Approximation with Large Sparse Binary Data Sets

Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: customers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important practical problem for such databases. We investigate the application of probabilistic models to this problem. In particular, we study a Markov random field (MRF) approach based on frequent sets and maximum entropy, and compare it to the independence model and the Chow-Liu tree model. We find that the MRF model provides substantially more accurate probability estimates than the other methods but is more expensive from a computational and memory viewpoint. To alleviate the computational requirements we show how one can apply bucket elimination and clique tree approaches to take advantage of structure in the models and in the queries. We provide experimental results on two large real-world transaction datasets.

[1]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[2]  Padhraic Smyth,et al.  Processing Boolean queries over Belief networks , 2000 .

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  R. Jirousek,et al.  On the effective implementation of the iterative proportional fitting procedure , 1995 .

[5]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[6]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[8]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[9]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[10]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[11]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[12]  Heikki Mannila,et al.  Approximate Query Answering with Frequent Sets and Maximum Entropy , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[13]  Francesco M. Malvestuto,et al.  Comment on "A unique formal system for binary decompositions of database relations, probability distributions, and graphs" , 1992, Inf. Sci..

[14]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[15]  Francesco M. Malvestuto A unique formal system for binary decompositions of database relations, probability distributions, and graphs , 1992, Inf. Sci..

[16]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[17]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[18]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[19]  Heikki Mannila,et al.  Prediction with local patterns using cross-entropy , 1999, KDD '99.