Learning with Mixtures of Trees

This paper describes the mixtures-of-trees model, a probabilistic model for discrete multidimensional domains. Mixtures-of-trees generalize the probabilistic trees of Chow and Liu (1968) in a different and complementary direction to that of Bayesian networks. We present efficient algorithms for learning mixtures-of-trees models in maximum likelihood and Bayesian frameworks. We also discuss additional efficiencies that can be obtained when data are "sparse," and we present data structures and algorithms that exploit such sparseness. Experimental results demonstrate the performance of the model for both density estimation and classification. We also discuss the sense in which tree-based classifiers perform an implicit form of feature selection, and demonstrate a resulting insensitivity to irrelevant attributes.

[1]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[2]  P. Mazur On the theory of brownian motion , 1959 .

[3]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[4]  Grace Jordison Molecular Biology of the Gene , 1965, The Yale Journal of Biology and Medicine.

[5]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[6]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  P. Kloeden,et al.  Numerical Solution of Stochastic Differential Equations , 1992 .

[9]  G. J. Haltiner Numerical Prediction and Dynamic Meteorology , 1980 .

[10]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .

[11]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[12]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  H. Risken The Fokker-Planck equation : methods of solution and applications , 1985 .

[14]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1984, JACM.

[15]  Robert E. Tarjan,et al.  Efficient algorithms for finding minimum spanning trees in undirected and directed graphs , 1986, Comb..

[16]  J. Loehlin Latent variable models , 1987 .

[17]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[18]  Matthew Self,et al.  Bayesian Classification , 1988, AAAI.

[19]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[20]  A. P. Dawid,et al.  Independence properties of directed Markov fields. Networks, 20, 491-505 , 1990 .

[21]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[22]  Prakash P. Shenoy,et al.  Probability propagation , 1990, Annals of Mathematics and Artificial Intelligence.

[23]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[24]  Steffen L. Lauritzen,et al.  Bayesian updating in causal probabilistic networks by local computations , 1990 .

[25]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[26]  Steffen L. Lauritzen,et al.  Independence properties of directed markov fields , 1990, Networks.

[27]  Jude W. Shavlik,et al.  Interpretation of Artificial Neural Networks: Mapping Knowledge-Based Neural Networks into Rules , 1991, NIPS.

[28]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[29]  Dan Geiger,et al.  An Entropy-based Learning Algorithm of Bayesian Conditional Trees , 1992, UAI.

[30]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[31]  Judea Pearl,et al.  An Algorithm for Deciding if a Set of Observed Independencies Has a Causal Explanation , 1992, UAI.

[32]  A. P. Dawid,et al.  Applications of a general propagation algorithm for probabilistic expert systems , 1992 .

[33]  Uue Kjjrull Approximation of Bayesian Networks through Edge Removals D Approximation of Bayesian Networks through Edge Removals , 1993 .

[34]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[35]  Denise Draper,et al.  Localized Partial Evaluation of Belief Networks , 1994, UAI.

[36]  Frank Jensen,et al.  Optimal junction Trees , 1994, UAI.

[37]  P. Courtier,et al.  A strategy for operational implementation of 4D‐Var, using an incremental approach , 1994 .

[38]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[39]  D. Madigan,et al.  Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window , 1994 .

[40]  Wray Buntine,et al.  A Guide to the Literature on Learning Graphical Models , 1994 .

[41]  Michael Ghil,et al.  Advanced data assimilation in strongly nonlinear dynamical systems , 1994 .

[42]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[43]  S. Lauritzen The EM algorithm for graphical association models with missing data , 1995 .

[44]  Walter R. Gilks,et al.  BUGS - Bayesian inference Using Gibbs Sampling Version 0.50 , 1995 .

[45]  Peter Dayan,et al.  Competition and Multiple Cause Models , 1995, Neural Comput..

[46]  R. Zemel,et al.  Learning sparse multiple cause models , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[47]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[48]  Brendan J. Frey,et al.  Does the Wake-sleep Algorithm Produce Good Density Estimators? , 1995, NIPS.

[49]  Wray L. Buntine A Guide to the Literature on Learning Probabilistic Networks from Data , 1996, IEEE Trans. Knowl. Data Eng..

[50]  David Heckerman,et al.  Knowledge Representation and Inference in Similarity Networks and Bayesian Multinets , 1996, Artif. Intell..

[51]  Radford M. Neal,et al.  Near Shannon limit performance of low density parity check codes , 1996 .

[52]  Alain Glavieux,et al.  Reflections on the Prize Paper : "Near optimum error-correcting coding and decoding: turbo codes" , 1998 .

[53]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[54]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[55]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[56]  Geoffrey E. Hinton,et al.  The delve manual , 1996 .

[57]  Craig Boutilier,et al.  Context-Specific Independence in Bayesian Networks , 1996, UAI.

[58]  Dan Geiger,et al.  A sufficiently fast algorithm for finding close to optimal junction trees , 1996, UAI.

[59]  P. Spirtes,et al.  A Polynomial Time Algorithm for Determining DAG Equivalence in the Presence of Latent Variables and Selection Bias , 1997, AISTATS.

[60]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[61]  Nir Friedman,et al.  Building Classifiers Using Bayesian Networks , 1996, AAAI/IAAI, Vol. 2.

[62]  Kathryn Fraughnaugh,et al.  Introduction to graph theory , 1973, Mathematical Gazette.

[63]  Michael I. Jordan,et al.  Estimating Dependency Structure as a Hidden Variable , 1997, NIPS.

[64]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[65]  Robert G. Cowell,et al.  Sampling without replacement in junction trees , 1997 .

[66]  Michael I. Jordan,et al.  Variational methods for inference and estimation in graphical models , 1997 .

[67]  Luis M. de Campos,et al.  Algorithms for Learning Decomposable Models and Chordal Graphs , 1997, UAI.

[68]  Edward H. Adelson,et al.  Belief Propagation and Revision in Networks with Loops , 1997 .

[69]  Hyeonjoon Moon,et al.  The FERET evaluation methodology for face-recognition algorithms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[70]  Geoffrey E. Hinton,et al.  Automated motif discovery in protein structure prediction , 1997 .

[71]  Weiru Liu,et al.  Learning belief networks from data: an information theory based approach , 1997, CIKM '97.

[72]  Michael I. Jordan,et al.  Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[73]  Volker Tresp,et al.  Nonlinear Markov Networks for Continuous Variables , 1997, NIPS.

[74]  Michael I. Jordan,et al.  A Mean Field Learning Algorithm for Unsupervised Neural Networks , 1999, Learning in Graphical Models.

[75]  Jim Q. Smith,et al.  On the Geometry of Bayesian Graphical Models with Hidden Variables , 1998, UAI.

[76]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[77]  Christopher M. Bishop Latent Variable Models , 1998, Learning in Graphical Models.

[78]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[79]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[80]  Dan Geiger,et al.  Graphical Models and Exponential Families , 1998, UAI.

[81]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[82]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[83]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[84]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[85]  Michael I. Jordan Graphical Models , 2003 .

[86]  Gregory F. Cooper,et al.  A Bayesian Network Classifier that Combines a Finite Mixture Model and a NaIve Bayes Model , 1999, UAI.

[87]  Lise Getoor,et al.  Efficient learning using constrained sufficient statistics , 1999, AISTATS.

[88]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[89]  Sanjoy Dasgupta,et al.  Learning Polytrees , 1999, UAI.

[90]  Marina Meila-Predoviciu,et al.  Learning with Mixtures of Trees , 1999 .

[91]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[92]  G. Evensen,et al.  An ensemble Kalman smoother for nonlinear dynamics , 2000 .

[93]  G. Eyink,et al.  Most Probable Histories for Nonlinear Dynamics: Tracking Climate Transitions , 2000 .

[94]  Tommi S. Jaakkola,et al.  Tutorial on variational approximation methods , 2000 .

[95]  Lehel Csató,et al.  Sparse On-Line Gaussian Processes , 2002, Neural Computation.

[96]  Eugenia Kalnay,et al.  Atmospheric Modeling, Data Assimilation and Predictability , 2002 .

[97]  J. Whitaker,et al.  Ensemble Data Assimilation without Perturbed Observations , 2002 .