Unsupervised Learning

We give a tutorial and overview of the field of unsupervised learning from the perspective of statistical modeling. Unsupervised learning can be motivated from information theoretic and Bayesian principles. We briefly review basic models in unsupervised learning, including factor analysis, PCA, mixtures of Gaussians, ICA, hidden Markov models, state-space models, and many variants and extensions. We derive the EM algorithm and give an overview of fundamental concepts in graphical models, and inference algorithms on graphs. This is followed by a quick tour of approximate Bayesian inference, including Markov chain Monte Carlo (MCMC), Laplace approximation, BIC, variational approximations, and expectation propagation (EP). The aim of this chapter is to provide a high-level view of the field. Along the way, many state-ofthe-art ideas and future directions are also reviewed.

[1]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[2]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[3]  R. Shepard The analysis of proximities: Multidimensional scaling with an unknown distance function. I. , 1962 .

[4]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[5]  D. Mayne,et al.  Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering† , 1969 .

[6]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[7]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[8]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[9]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[13]  R. Shumway,et al.  AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[14]  T. Ferguson BAYESIAN DENSITY ESTIMATION BY MIXTURES OF NORMAL DISTRIBUTIONS , 1983 .

[15]  Judea Pearl,et al.  A Computational Model for Causal and Diagnostic Reasoning in Inference Systems , 1983, IJCAI.

[16]  Graham C. Goodwin,et al.  Adaptive filtering prediction and control , 1984 .

[17]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[18]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[19]  Derek G. Corneil,et al.  Complexity of finding embeddings in a k -tree , 1987 .

[20]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[21]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[22]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[23]  Uue Kjjrull Triangulation of Graphs { Algorithms Giving Small Total State Space Triangulation of Graphs { Algorithms Giving Small Total State Space , 1990 .

[24]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[25]  James O. Berger,et al.  Ockham's Razor and Bayesian Analysis , 1992 .

[26]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[27]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[28]  A. Glavieux,et al.  Near Shannon limit error-correcting coding and decoding: Turbo-codes. 1 , 1993, Proceedings of ICC '93 - IEEE International Conference on Communications.

[29]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[30]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[31]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[32]  Stuart J. Russell,et al.  Stochastic simulation algorithms for dynamic probabilistic networks , 1995, UAI.

[33]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[34]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[35]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[36]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[37]  Steve R. Waterhouse,et al.  Bayesian Methods for Mixtures of Experts , 1995, NIPS.

[38]  Petros G. Voulgaris,et al.  On optimal ℓ∞ to ℓ∞ filtering , 1995, Autom..

[39]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[40]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[41]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[42]  G. Kitagawa Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear State Space Models , 1996 .

[43]  Jorma Laaksonen,et al.  LVQ_PAK: The Learning Vector Quantization Program Package , 1996 .

[44]  Geoffrey E. Hinton,et al.  Modeling the manifolds of images of handwritten digits , 1997, IEEE Trans. Neural Networks.

[45]  Jeffrey K. Uhlmann,et al.  New extension of the Kalman filter to nonlinear systems , 1997, Defense, Security, and Sensing.

[46]  D.J.C. MacKay,et al.  Good error-correcting codes based on very sparse matrices , 1997, Proceedings of IEEE International Symposium on Information Theory.

[47]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[48]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[49]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[50]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[51]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[52]  Zoubin Ghahramani,et al.  Learning Nonlinear Dynamical Systems Using an EM Algorithm , 1998, NIPS.

[53]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[54]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[55]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[56]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[57]  Rudolph van der Merwe,et al.  Dual Estimation and the Unscented Transformation , 1999, NIPS.

[58]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[59]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[60]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[61]  Yair Weiss,et al.  Correctness of Local Probability Propagation in Graphical Models with Loops , 2000, Neural Computation.

[62]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[63]  Carl E. Rasmussen,et al.  Occam's Razor , 2000, NIPS.

[64]  Nir Friedman,et al.  Discovering Hidden Variables: A Structure-Based Approach , 2000, NIPS.

[65]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[66]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[67]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[68]  David Maxwell Chickering,et al.  A comparison of scientific and engineering criteria for Bayesian model selection , 2000, Stat. Comput..

[69]  Zoubin Ghahramani,et al.  Propagation Algorithms for Variational Bayesian Learning , 2000, NIPS.

[70]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[71]  William T. Freeman,et al.  On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs , 2001, IEEE Trans. Inf. Theory.

[72]  M. Seeger Learning with labeled and unlabeled dataMatthias , 2001 .

[73]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[74]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[75]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[76]  Thomas P. Minka,et al.  The EP energy function and minimization schemes , 2001 .

[77]  Vittorio Castelli,et al.  Multidimensional Indexing Structures for Content‐Based Retrieval , 2002 .

[78]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[79]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[80]  Thomas L. Griffiths,et al.  Semi-Supervised Learning with Trees , 2003, NIPS.

[81]  Andrew W. Moore,et al.  Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning , 2003, ICML.

[82]  Radford M. Neal,et al.  Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models , 2003, NIPS.

[83]  Zoubin Ghahramani,et al.  Optimization with EM and Expectation-Conjugate-Gradient , 2003, ICML.

[84]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[85]  Radford M. Neal,et al.  Density Modeling and Clustering Using Dirichlet Diffusion Trees , 2003 .

[86]  Matthew J. Beal,et al.  The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures , 2003 .

[87]  M. Lewicki,et al.  Learning higher-order structures in natural images , 2003, Network.

[88]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[89]  John M. Winn,et al.  Variational Message Passing and its Applications , 2004 .

[90]  Michael I. Jordan,et al.  Variational methods for the Dirichlet process , 2004, ICML.

[91]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[92]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[93]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[94]  R. Baierlein Probability Theory: The Logic of Science , 2004 .

[95]  Nir Friedman,et al.  Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks , 2004, Machine Learning.

[96]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[97]  Zoubin Ghahramani,et al.  Bayesian Learning in Undirected Graphical Models: Approximate MCMC Algorithms , 2004, UAI.

[98]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.