Advanced probabilistic models for clustering and projection

Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems. The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm. In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process. In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  Volker Tresp,et al.  Multi-label informed latent semantic indexing , 2005, SIGIR '05.

[3]  Aleks Jakulin,et al.  Applying Discrete PCA in Data Analysis , 2004, UAI.

[4]  Stan Lipovetsky,et al.  Latent Variable Models and Factor Analysis , 2001, Technometrics.

[5]  Michael I. Jordan,et al.  Variational methods for the Dirichlet process , 2004, ICML.

[6]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[7]  Volker Tresp,et al.  A nonparametric hierarchical bayesian framework for information filtering , 2004, SIGIR '04.

[8]  Hans-Peter Kriegel,et al.  Multi-Output Regularized Feature Projection , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[10]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[11]  Anton Schwaighofer,et al.  Hierarchical Bayesian modelling with Gaus-sian processes , 2005, NIPS 2005.

[12]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[13]  Adrian Corduneanu,et al.  Variational Bayesian Model Selection for Mixture Distributions , 2001 .

[14]  William W. Cohen,et al.  Recommendation as Classification: Using Social and Content-Based Information in Recommendation , 1998, AAAI/IAAI.

[15]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[17]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[18]  P. Green,et al.  Modelling Heterogeneity With and Without the Dirichlet Process , 2001 .

[19]  Robert J. Connor,et al.  Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution , 1969 .

[20]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[21]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[22]  Hans-Peter Kriegel,et al.  Dirichlet enhanced relational learning , 2005, ICML.

[23]  Hans-Peter Kriegel,et al.  Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures , 2006, ECML.

[24]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[25]  Kenneth Y. Goldberg,et al.  Eigentaste: A Constant Time Collaborative Filtering Algorithm , 2001, Information Retrieval.

[26]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[27]  David R. Hardoon,et al.  KCCA Feature Selection for fMRI Analysis , 2004 .

[28]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[29]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[30]  Matthew J. Beal,et al.  Graphical Models and Variational Methods , 2001 .

[31]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[32]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[33]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[34]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[35]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[36]  Roman Rosipal,et al.  Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space , 2002, J. Mach. Learn. Res..

[37]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[38]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[39]  Hans-Peter Kriegel,et al.  Hierarchy-regularized latent semantic indexing , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[40]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[41]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[42]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[43]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[44]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[45]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[46]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[47]  Hans-Peter Kriegel,et al.  Supervised probabilistic principal component analysis , 2006, KDD '06.

[48]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[49]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[50]  Chris H. Q. Ding,et al.  Adaptive dimension reduction for clustering high dimensional data , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[51]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[52]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[53]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[54]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[55]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[56]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[57]  Volker Tresp,et al.  Dirichlet Enhanced Latent Semantic Analysis , 2005, AISTATS.

[58]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[59]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[60]  Volker Tresp,et al.  Soft Clustering on Graphs , 2005, NIPS.

[61]  Volker Tresp,et al.  Multi-output regularized projection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[62]  Hans-Peter Kriegel,et al.  A Probabilistic Clustering-Projection Model for Discrete Data , 2005, PKDD.

[63]  Gene H. Golub,et al.  Matrix computations , 1983 .

[64]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[65]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[66]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[67]  Zoubin Ghahramani,et al.  The EM-EP algorithm for Gaussian process classification , 2003 .

[68]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[69]  David G. Stork,et al.  Pattern Classification , 1973 .

[70]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[71]  Dan Klein,et al.  Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based approach , 2002, ICML.

[72]  Michael I. Jordan,et al.  Mixtures of Probabilistic Principal Component Analyzers , 2001 .

[73]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[74]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[75]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[76]  Bernhard Schölkopf,et al.  Kernel Dependency Estimation , 2002, NIPS.

[77]  Wray L. Buntine,et al.  Is Multinomial PCA Multi-faceted Clustering or Dimensionality Reduction? , 2003, AISTATS.

[78]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[79]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[80]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[81]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[82]  Tzu-Tsung Wong,et al.  Generalized Dirichlet distribution in Bayesian analysis , 1998, Appl. Math. Comput..