Model order selection: Criteria, inference strategies and an application to biclustering

In this thesis we study unsupervised clustering methods that select the number of clusters on their own. Traditional methods based on information theory, compare different models by penalizing more complicated models. More recently a sophisticated method, known as the Dirichlet process has been applied to clustering problems; one of its biggest advantages is the theoretical sound foundation: we have one model for different number of clusters. This however comes at a price, too: The inference is arguably even harder than for " standard " clustering models, but in recent years researchers proposed approximation algorithms that run efficiently, but sacrifice accuracy to a certain extent. In this thesis we aim to empirically compare these algorithms on synthetic data. We also compare the results with algorithms stemming from different motivations than the Dirichlet process, such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC). In the second part we then study the application of the Dirichlet process to the problem of biclustering and propose two novel nonparametric algorithms, each of them assuming a different problem formulation. The two algorithms might also prove to be useful for feature selection and dimensionality reduction. Acknowledgments First and above all, I want to express my gratitude to Peter Orbanz; during the course of this master's thesis he always took the time to answer my questions and he gave me valuable input how to improve certain experiments or express facts more concisely. He has the great gift of explaining things in an easy-to-understand way, which however doesn't sacrifice correctness. I could already witness that in the machine learning courses I attended as part of my studies where Peter was a teaching assistant. Then I also would like to thank Prof. Buhmann, for being my mentor and leveraging my interest in machine learning during my master's studies. He was very supportive in finding a topic iii Preface that suits my interests and knowledge, and also left a certain degree of freedom, to see where the journey takes us. Also, his comments in various meetings were very helpful in refining the biclustering models. I was fortunate enough to work half a year as an intern under the supervision of Matthew Brand at the Mitsubishi Electric Research Labs (MERL) in Cambridge, MA. Matt, possessing an immense knowledge of such diverse areas such as machine learning, graphics, computer vision or theoretical computer science, could usually answer questions I …

[1]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[2]  S. MacEachern,et al.  A semiparametric Bayesian model for randomised block designs , 1996 .

[3]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[4]  Kenichi Kurihara,et al.  A Frequency-based Stochastic Blockmodel , 2006 .

[5]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[6]  Max Welling,et al.  Accelerated Variational Dirichlet Process Mixtures , 2006, NIPS.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Yee Whye Teh,et al.  Collapsed Variational Dirichlet Process Mixture Models , 2007, IJCAI.

[9]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[10]  M. West,et al.  Hyperparameter estimation in Dirichlet process mixture models , 1992 .

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  H. Akaike A new look at the statistical model identification , 1974 .

[13]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[14]  Daniel N. Osherson,et al.  Joshua Stern, Ormond Wilkie, Michael Stob, Edward E. Smith: Default Probability , 1991, Cogn. Sci..

[15]  M. Schervish Theory of Statistics , 1995 .

[16]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Zoubin Ghahramani,et al.  Propagation Algorithms for Variational Bayesian Learning , 2000, NIPS.

[19]  Joachim M. Buhmann,et al.  Histogram clustering for unsupervised segmentation and image retrieval , 1999, Pattern Recognit. Lett..

[20]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[22]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[23]  Thomas L. Griffiths,et al.  Learning Systems of Concepts with an Infinite Relational Model , 2006, AAAI.

[24]  Panos M. Pardalos,et al.  Feature Selection for Consistent Biclustering via Fractional 0–1 Programming , 2005, J. Comb. Optim..

[25]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[26]  Anil K. Jain,et al.  Feature Selection in Mixture-Based Clustering , 2002, NIPS.

[27]  B. Schölkopf,et al.  Modeling Dyadic Data with Binary Latent Factors , 2007 .

[28]  Hans-Hermann Bock Two-Way Clustering for Contingency Tables: Maximizing a Dependence Measure , 2003 .

[29]  Emanuele Trucco,et al.  Robust motion and correspondence of noisy 3-D point sets with missing data , 1999, Pattern Recognit. Lett..

[30]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[31]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[32]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[33]  Radford M. Neal Bayesian Mixture Modeling , 1992 .

[34]  M. Escobar Estimating Normal Means with a Dirichlet Process Prior , 1994 .

[35]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[36]  Naftali Tishby,et al.  Information Bottleneck for Non Co-Occurrence Data , 2006, NIPS.

[37]  Volker Roth,et al.  Feature Selection in Clustering Problems , 2003, NIPS.

[38]  Martin J. Wainwright,et al.  Tree-based reparameterization framework for analysis of sum-product and related algorithms , 2003, IEEE Trans. Inf. Theory.

[39]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[40]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[41]  Michael A. West,et al.  Hierarchical priors and mixture models, with applications in regression and density estimation , 2006 .

[42]  Kenichi Kurihara,et al.  Discovering Concepts from Word Co-occurrences with a Relational Model , 2007 .

[43]  Joachim M. Buhmann,et al.  Stability-Based Model Selection , 2002, NIPS.

[44]  Daniel,et al.  Default Probability , 2004 .

[45]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[46]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[47]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[48]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .

[49]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.