Cross-Validation for Model Selection in Model-Based Clustering

Cross-Validation for Model Selection in Model-Based Clustering Rachel L.H. O’Reilly Advisor: University of Guelph, 2012 Paul D. McNicholas Clustering is a technique used to partition unlabelled data into meaningful groups. This thesis will focus on the area of clustering called model-based clustering, where it is assumed that data arise from a finite number of subpopulations, each of which follows a known statistical distribution. The number of groups and shape of each group is unknown in advance, and thus one of the most challenging aspects of clustering is selecting these features. Cross-validation is a model selection technique which is often used in regression and classification, because it tends to choose models that predict well, and are not over-fit to the data. However, it has rarely been applied in a clustering framework. Herein, cross-validation is applied to select the number of groups and covariance structure within a family of Gaussian mixture models. Results are presented for both real and simulated data.

[1]  Paul D. McNicholas,et al.  Model-based clustering of microarray expression data via latent Gaussian mixture models , 2010, Bioinform..

[2]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[5]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[6]  N. Campbell,et al.  A multivariate study of variation in two species of rock crab of the genus Leptograpsus , 1974 .

[7]  P. McNicholas,et al.  Extending mixtures of multivariate t-factor analyzers , 2011, Stat. Comput..

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[10]  H. Akaike A new look at the statistical model identification , 1974 .

[11]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[12]  G. Celeux,et al.  Assessing a Mixture Model for Clustering with the Integrated Classification Likelihood , 1998 .

[13]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[14]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[15]  P. McNicholas On Model-Based Clustering, Classification, and Discriminant Analysis , 2011 .

[16]  Gilles Celeux,et al.  Combining Mixture Components for Clustering , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[17]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[18]  P. McNicholas,et al.  Mixtures of modified t-factor analyzers for model-based clustering, classification, and discriminant , 2011 .

[19]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[20]  Adrian E. Raftery,et al.  MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering † , 2007 .

[21]  Geoffrey J. McLachlan,et al.  Mixtures of common t-factor analyzers for clustering high-dimensional microarray data , 2011, Bioinform..

[22]  Paul D. McNicholas,et al.  Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions , 2011, Statistics and Computing.

[23]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[24]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[25]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[26]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[27]  Cordelia Schmid,et al.  High-dimensional data clustering , 2006, Comput. Stat. Data Anal..

[28]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[29]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[31]  Paul D. McNicholas,et al.  Clustering gene expression time course data using mixtures of multivariate t-distributions , 2012 .

[32]  L. Hubert,et al.  Comparing partitions , 1985 .

[33]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[34]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[35]  Ryan P. Browne,et al.  Model-based clustering, classification, and discriminant analysis of data with mixed type , 2012 .

[36]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .