Latent tree models for multivariate density estimation: algorithms and applications

Multivariate density estimation is a fundamental problem in Applied Statistics and Machine Learning. Given a collection of data sampled from an unknown distribution, the task is to approximately reconstruct the generative distribution. There are two different approaches to the problem, the parametric approach and the non-parametric approach. In the parametric approach, the approximate distribution is represented by a model from a predetermined family. In this thesis, we adopt the parametric approach and investigate the use of a model family called latent tree models for the task of density estimation. Latent tree models are tree-structured Bayesian networks in which leaf nodes represent observed variables, while internal nodes represent hidden variables. Such models can represent complex relationships among observed variables, and in the meantime, admit efficient inference among them. Consequently, they are a desirable tool for density estimation. While latent tree models are studied for the first time in this thesis for the purpose of density estimation, they have been investigated earlier for clustering and latent structure discovery. Several algorithms for learning latent tree models have been proposed. The state-of-the-art is an algorithm called EAST. EAST determines model structures through principled and systematic search, and determines model parameters using the EM algorithm. It has been shown to be capable of achieving good trade-off between fit to data and model complexity. It is also capable of discovering latent structures behind data. Unfortunately, it has a high computational complexity, which limits its applicability to density estimation problems. In this thesis, we propose two latent tree model learning algorithms specifically for density estimation. The two algorithms have distinct characteristics and are suitable for different applications. The first algorithm is called HCL. HCL assumes a predetermined bound on model complexity and restricts to binary model structures. It first builds a binary tree structure based on mutual information and then runs the EM algorithm once on the resulting structure to determine the parameters. As such, it is efficient and can deal with large applications. The second algorithm is called Pyramid. Pyramid does not assume predetermined bounds on model complexity and does not restrict to binary tree structures. It builds model structures using heuristics based on mutual information and local search. It is slower than HCL. However, it is faster than EAST and is only slightly inferior to EAST in terms of the quality of the resulting models. In this thesis, we also study two applications of the density estimation techniques that we develop. The first application is to approximate probabilistic inference in Bayesian networks. A Bayesian network represents a joint distribution over a set of random variables. It often happens that the network structure is very complex and making inference directly on the network is computational intractable. We propose to approximate the joint distribution using a latent tree model and exploit the latent tree model for faster inference. The idea is to sample data from the Bayesian network, learn a latent tree model from the data offline, and when online, make inference with the latent tree model instead of the original Bayesian network. HCL is used here because the sample size needs to be large to produce accurate approximation and it is possible to predetermine a bound on the online running. Empirical evidence shows that this method can achieve good approximation accuracy at low online computational cost. The second application is classification. A common approach to this task is to formulate it as a density estimation problem: One constructs the class-conditional density for each class and then uses the Bayes rule to make classification. We propose to estimate those class-conditional densities using either EAST or Pyramid. Empiricalevidence shows that this method yields good classification performances. Moreover, the latent tree models built for the class-conditional densities are often meaningful, which is conducive to user confidence. A comparison between EAST and Pyramid reveals that Pyramid is significantly more efficient than EAST, while it results in more or less the same classification performance as the latter.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[3]  Nir Friedman,et al.  "Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks , 2007, J. Mach. Learn. Res..

[4]  Tao Chen,et al.  Discovery of latent structures: Experience with the CoIL Challenge 2000 data set* , 2008, J. Syst. Sci. Complex..

[5]  H. Akaike A new look at the statistical model identification , 1974 .

[6]  Prakash P. Shenoy,et al.  Axioms for probability and belief-function proagation , 1990, UAI.

[7]  Tao Chen,et al.  Hierarchical Latent Class Models and Statistical Foundation for Traditional Chinese Medicine , 2007, AIME.

[8]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .

[9]  Max Henrion,et al.  Propagating uncertainty in bayesian networks by probabilistic logic sampling , 1986, UAI.

[10]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[11]  Paul D. Seymour,et al.  Graph minors. III. Planar tree-width , 1984, J. Comb. Theory B.

[12]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[13]  Gregory F. Cooper,et al.  A Bayesian Network Classifier that Combines a Finite Mixture Model and a NaIve Bayes Model , 1999, UAI.

[14]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[15]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[16]  Brendan J. Frey,et al.  A Revolution: Belief Propagation in Graphs with Cycles , 1997, NIPS.

[17]  L. Williams,et al.  Contents , 2020, Ophthalmology (Rochester, Minn.).

[18]  J. Laurie Snell,et al.  Markov Random Fields and Their Applications , 1980 .

[19]  R. van Engelen,et al.  Approximating Bayesian belief networks by arc removal , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Michael I. Jordan,et al.  Exploiting Tractable Substructures in Intractable Networks , 1995, NIPS.

[21]  Geoffrey I. Webb,et al.  Not So Naive Bayes: Aggregating One-Dependence Estimators , 2005, Machine Learning.

[22]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[23]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[24]  Frank Jensen,et al.  Approximations in Bayesian Belief Universe for Knowledge Based Systems , 2013, UAI 1990.

[25]  Nevin L. Zhang,et al.  A simple approach to Bayesian network computations , 1994 .

[26]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[27]  Nir Friedman,et al.  "Ideal Parent" Structure Learning for Continuous Variable Networks , 2004, UAI.

[28]  Thomas D. Nielsen,et al.  Latent Classification Models , 2005, Machine Learning.

[29]  Yi Wang,et al.  Severity of Local Maxima for the EM Algorithm: Experiences with Hierarchical Latent Class Models , 2006, Probabilistic Graphical Models.

[30]  N. Zhang,et al.  Statistical validation of traditional chinese medicine theories. , 2008, Journal of alternative and complementary medicine.

[31]  Adnan Darwiche,et al.  A Variational Approach for Approximating Bayesian Networks by Edge Deletion , 2006, UAI.

[32]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[33]  Tao Chen,et al.  Latent Tree Models and Approximate Inference in Bayesian Networks , 2008, AAAI.

[34]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[35]  Uffe Kjærulff,et al.  Reduction of Computational Complexity in Bayesian Networks Through Removal of Weak Dependences , 1994, UAI.

[36]  P. Deb Finite Mixture Models , 2008 .

[37]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[38]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[39]  Trevor J. Hastie,et al.  Discriminative vs Informative Learning , 1997, KDD.

[40]  Sumit Sarkar,et al.  Modeling uncertainty using enhanced tree structures in expert systems , 1995, IEEE Trans. Syst. Man Cybern..

[41]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[42]  Adnan Darwiche,et al.  Recursive conditioning , 2001, Artif. Intell..

[43]  J. Shao AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION , 1997 .

[44]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[45]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[46]  Neil Henry Latent structure analysis , 1969 .

[47]  Paul F. Lazarsfeld,et al.  Latent Structure Analysis. , 1969 .

[48]  Tim Niblett,et al.  Constructing Decision Trees in Noisy Domains , 1987, EWSL.

[49]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[50]  Tao Chen,et al.  Efficient Model Evaluation in the Search-Based Approach to Latent Structure Discovery , 2008 .

[51]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[52]  Pedro M. Domingos,et al.  Naive Bayes models for probability estimation , 2005, ICML.

[53]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[54]  Tomas Kocka,et al.  Efficient learning of hierarchical latent class models , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[55]  Gregory M. Provan,et al.  Knowledge Engineering for Large Belief Networks , 1994, UAI.

[56]  Stan Lipovetsky,et al.  Latent Variable Models and Factor Analysis , 2001, Technometrics.

[57]  Neil D. Lawrence,et al.  Approximating Posterior Distributions in Belief Networks Using Mixtures , 1997, NIPS.

[58]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[59]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[60]  Tao Chen,et al.  Search-based learning of latent tree models , 2009 .

[61]  Tao Chen,et al.  Latent tree models and diagnosis in traditional Chinese medicine , 2008, Artif. Intell. Medicine.

[62]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[63]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[64]  Tony Jebara,et al.  Machine learning: Discriminative and generative , 2006 .

[65]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[66]  David Maxwell Chickering,et al.  Optimal Structure Identification With Greedy Search , 2002, J. Mach. Learn. Res..

[67]  Nevin Lianwen Zhang,et al.  Hierarchical latent class models for cluster analysis , 2002, J. Mach. Learn. Res..

[68]  F. Krauss Latent Structure Analysis , 1980 .

[69]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[70]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[71]  Thomas D. Nielsen,et al.  Latent variable discovery in classification models , 2004, Artif. Intell. Medicine.