Fast factored density estimation and compression with bayesian networks

Many important data analysis tasks can be addressed by formulating them as probability estimation problems. For example, a popular general approach to automatic classification problems is to learn a probabilistic model of each class from data in which the classes are known, and then use Bayes's rule with these models to predict the correct classes of other data for which they are not known. Anomaly detection and scientific discovery tasks can often be addressed by learning probability models over possible events and then looking for events to which these models assign low probabilities. Many data compression algorithms such as Huffman coding and arithmetic coding rely on probabilistic models of the data stream in order to achieve high compression rates. In this thesis we examine several aspects of probability estimation algorithms. In particular, we focus on the automatic learning and use of probability models based on Bayesian networks, a convenient formalism in which the probability estimation task is split into many simpler subtasks. We also emphasize computational efficiency. First, we provide Bayesian network-based algorithms for losslessly compressing large discrete datasets. We show that these algorithms can produce compression ratios dramatically higher than those achieved by popular compression programs such as gzip or bzip2, yet still maintain megabyte-per-second decoding speeds on well-aged conventional PCs. Next, we provide algorithms for quickly learning Bayesian network-based probability models over domains with both discrete and continuous variables. We show how recently developed methods for quickly learning Gaussian mixture models from data [Moo99] can be used to learn Bayesian networks modeling complex nonlinear relationships over dozens of variables from thousands of datapoints in a practical amount of time. Finally we explore a large space of tree-based density learning algorithms, and show that they can be used to quickly learn Bayesian networks that can provide accurate density estimates and that are fast to evaluate.

[1]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[2]  U. Fayyad,et al.  Scaling EM (Expectation Maximization) Clustering to Large Databases , 1998 .

[3]  Andrew W. Moore,et al.  Efficient Locally Weighted Polynomial Regression Predictions , 1997, ICML.

[4]  David Maxwell Chickering,et al.  Learning Bayesian Networks is NP-Complete , 2016, AISTATS.

[5]  J. Simonoff Multivariate Density Estimation , 1996 .

[6]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[7]  Gregory F. Cooper,et al.  A latent variable model for multivariate discretization , 1999, AISTATS.

[8]  Nir Friedman,et al.  Gaussian Process Networks , 2000, UAI.

[9]  Nir Friedman,et al.  Sequential Update of Bayesian Network Structure , 1997, UAI.

[10]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[11]  Jorma Rissanen,et al.  Generalized Kraft Inequality and Arithmetic Coding , 1976, IBM J. Res. Dev..

[12]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[13]  Nir Friedman,et al.  Discretizing Continuous Attributes While Learning Bayesian Networks , 1996, ICML.

[14]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[15]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[16]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[17]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[18]  Kevin P. Murphy,et al.  Learning the Structure of Dynamic Probabilistic Networks , 1998, UAI.

[19]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[20]  Khalid Sayood,et al.  Introduction to Data Compression , 1996 .

[21]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22]  Andrew W. Moore,et al.  Mix-nets: Factored Mixtures of Gaussians in Bayesian Networks with Mixed Continuous And Discrete Variables , 2000, UAI.

[23]  Brendan J. Frey,et al.  Does the Wake-sleep Algorithm Produce Good Density Estimators? , 1995, NIPS.

[24]  J. Morgan,et al.  Thaid a Sequential Analysis Program for the Analysis of Nominal Scale Dependent Variables , 1973 .

[25]  Steffen L. Lauritzen,et al.  Graphical models in R , 1996 .

[26]  Shmuel Tomi Klein,et al.  Efficient variants of Huffman codes in high level languages , 1985, SIGIR '85.

[27]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[28]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[29]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[30]  Gregory F. Cooper,et al.  A Multivariate Discretization Method for Learning Bayesian Networks from Mixed Data , 1998, UAI.

[31]  Nir Friedman,et al.  Learning Bayesian Networks with Local Structure , 1996, UAI.

[32]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[33]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[34]  Philip A. Chou,et al.  Optimal Partitioning for Classification and Regression Trees , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[36]  Darryl Morrell,et al.  Implementation of Continuous Bayesian Networks Using Sums of Weighted Gaussians , 1995, UAI.

[37]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[38]  Andrew W. Moore,et al.  Bayesian networks for lossless dataset compression , 1999, KDD '99.

[39]  Keki B. Irani,et al.  Multi-interval discretization of continuos attributes as pre-processing for classi cation learning , 1993, IJCAI 1993.

[40]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[41]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[42]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[43]  Alistair Moffat,et al.  On the implementation of minimum redundancy prefix codes , 1997, IEEE Trans. Commun..

[44]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[45]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[46]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[47]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[48]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[49]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[50]  Rajeev Rastogi,et al.  SPARTAN: a model-based semantic compression system for massive data tables , 2001, SIGMOD '01.

[51]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[52]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.


[54]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[55]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[56]  Daphne Koller,et al.  Nonuniform Dynamic Discretization in Hybrid Networks , 1997, UAI.

[57]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[58]  James S. Albus,et al.  Brains, behavior, and robotics , 1981 .

[59]  A. Pentland,et al.  The Generalized CEM Algorithm , 1999, NIPS 1999.

[60]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[61]  Richard Clark Pasco,et al.  Source coding algorithms for fast data compression , 1976 .

[62]  Claude Berge,et al.  Graphs and Hypergraphs , 2021, Clustering.

[63]  David Heckerman,et al.  Models and Selection Criteria for Regression and Classification , 1997, UAI.

[64]  Michael I. Jordan,et al.  Efficient Stepwise Selection in Decomposable Models , 2001, UAI.

[65]  David Heckerman,et al.  Learning Bayesian Networks: A Unification for Discrete and Gaussian Domains , 1995, UAI.

[66]  Jacob Ziv,et al.  Coding theorems for individual sequences , 1978, IEEE Trans. Inf. Theory.

[67]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[68]  Alice M. Agogino,et al.  Inference Using Message Propagation and Topology Transformation in Vector Gaussian Continuous Networks , 1996, UAI.

[69]  Gregory F. Cooper,et al.  Learning Hybrid Bayesian Networks from Data , 1999, Learning in Graphical Models.

[70]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[71]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[72]  David Heckerman,et al.  Learning Gaussian Networks , 1994, UAI.

[73]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[74]  Daniel S. Hirschberg,et al.  Efficient decoding of prefix codes , 1990, CACM.

[75]  Thomas L. Dean,et al.  Probabilistic Temporal Reasoning , 1988, AAAI.

[76]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..