Exploring the number of groups in robust model-based clustering

Two key questions in Clustering problems are how to determine the number of groups properly and measure the strength of group-assignments. These questions are specially involved when the presence of certain fraction of outlying data is also expected.Any answer to these two key questions should depend on the assumed probabilistic-model, the allowed group scatters and what we understand by noise. With this in mind, some exploratory “trimming-based” tools are presented in this work together with their justifications. The monitoring of optimal values reached when solving a robust clustering criteria and the use of some “discriminant” factors are the basis for these exploratory tools.

[1]  M. Gallegos,et al.  Trimming algorithms for clustering contaminated grouped data and their robustness , 2009, Adv. Data Anal. Classif..

[2]  Alfonso Gordaliza Ramos,et al.  A general trimming approach to robust cluster analysis , 2007 .

[3]  David M. Rocke,et al.  Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator , 2004, Comput. Stat. Data Anal..

[4]  Michael J. Symons,et al.  Clustering criteria and multivariate normal mixtures , 1981 .

[5]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[6]  Christian Hennig Breakdown points for maximum likelihood-estimators of location-scale mixtures , 2002 .

[7]  Graphical Detection of Regression Outliers and Mixtures , 1999 .

[8]  E. Ziegel,et al.  Proceedings in Computational Statistics , 1998 .

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[11]  H. Bock Probabilistic models in cluster analysis , 1996 .

[12]  Christian Hennig,et al.  Validating visual clusters in large datasets: fixed point clusters of spectral features , 2002 .

[13]  Luis Angel García-Escudero,et al.  Trimming Tools in Exploratory Data Analysis , 2003 .

[14]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[16]  Bernhard N Flury Multivariate Statistics: A Practical Approach , 1988 .

[17]  Peter Filzmoser,et al.  Robust fitting of mixtures using the trimmed likelihood estimator , 2007, Comput. Stat. Data Anal..

[18]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[19]  Bernard D. Flury,et al.  Why Multivariate Statistics , 1997 .

[20]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[21]  David J. Olive,et al.  Inconsistency of Resampling Algorithms for High-Breakdown Regression Estimators and a New Algorithm , 2002 .

[22]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[23]  J. Hartigan,et al.  Percentage Points of a Test for Clusters , 1969 .

[24]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[25]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[26]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[27]  M. Gallegos,et al.  A robust method for cluster analysis , 2005, math/0504513.

[28]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[29]  David L. Woodruff,et al.  Experiments with, and on, algorithms for maximum likelihood clustering , 2004, Comput. Stat. Data Anal..

[30]  Gunter Ritter,et al.  Using combinatorial optimization in model-based trimmed clustering with cardinality constraints , 2010, Comput. Stat. Data Anal..

[31]  Carlos Matrán,et al.  Robust estimation in the normal mixture model based on robust clustering , 2008 .

[32]  Christian Hennig,et al.  Asymmetric Linear Dimension Reduction for Classification , 2004 .

[33]  Peter Filzmoser,et al.  MIXTURE OF GLMS AND THE TRIMMED LIKELIHOOD METHODOLOGY , 2004 .

[34]  H. Riedwyl,et al.  Multivariate Statistics: A Practical Approach , 1988 .

[35]  J. A. Cuesta-Albertos,et al.  Trimmed $k$-means: an attempt to robustify quantizers , 1997 .

[36]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[37]  Xiaogang Wang,et al.  Linear grouping using orthogonal regression , 2006, Comput. Stat. Data Anal..

[38]  Hans-Hermann Bock,et al.  Classification, Clustering, and Data Analysis: Recent Advances and Applications , 2002 .

[39]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[40]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[41]  Peter G. Bryant,et al.  Large-sample results for optimization-based clustering methods , 1991 .

[42]  David L. Woodruff,et al.  Computational Connections between Robust Multivariate Analysis and Clustering , 2002, COMPSTAT.

[43]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[44]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[45]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[46]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[47]  R. Hathaway A Constrained Formulation of Maximum-Likelihood Estimation for Normal Mixture Distributions , 1985 .

[48]  Ursula Gather,et al.  The Masking Breakdown Point of Multivariate Outlier Identification Rules , 1999 .

[49]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[50]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[51]  María Teresa Gallegos,et al.  Maximum Likelihood Clustering with Outliers , 2002 .