Factor probabilistic distance clustering (FPDC): a new clustering method

Factor clustering methods have been developed in recent years thanks to improvements in computational power. These methods perform a linear transformation of data and a clustering of the transformed data, optimizing a common criterion. Probabilistic distance (PD)-clustering is an iterative, distribution free, probabilistic clustering method. Factor PD-clustering (FPDC) is based on PD-clustering and involves a linear transformation of the original variables into a reduced number of orthogonal ones using a common criterion with PD-clustering. This paper demonstrates that Tucker3 decomposition can be used to accomplish this transformation. Factor PD-clustering alternatingly exploits Tucker3 decomposition and PD-clustering on transformed data until convergence is achieved. This method can significantly improve the PD-clustering algorithm performance; large data sets can thus be partitioned into clusters with increasing stability and robustness of the results. Real and simulated data sets are used to compare FPDC with its main competitors, where it performs equally well when clusters are elliptically shaped but outperforms its competitors with non-Gaussian shaped clusters or noisy data.

[1]  S. Rachev,et al.  The Methods of Distances in the Theory of Probability and Statistics , 2013 .

[2]  Michio Yamamoto,et al.  A GENERAL FORMULATION OF CLUSTER ANALYSIS WITH DIMENSION REDUCTION AND SUBSPACE SEPARATION , 2014 .

[3]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[4]  Michelle A. Steane,et al.  Model-Based Classification via Mixtures of Multivariate t-Factor Analyzers , 2012, Commun. Stat. Simul. Comput..

[5]  J. Bezdek Numerical taxonomy with fuzzy sets , 1974 .

[6]  Ryan P. Browne,et al.  A mixture of generalized hyperbolic factor analyzers , 2013, Advances in Data Analysis and Classification.

[7]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[8]  H. Kiers,et al.  Selecting among three-mode principal component models of different types and complexities: a numerical convex hull based method. , 2006, The British journal of mathematical and statistical psychology.

[9]  E. Ceulemans,et al.  Subspace K-means clustering , 2013, Behavior Research Methods.

[10]  J. Carroll,et al.  K-means clustering in a low-dimensional Euclidean space , 1994 .

[11]  Robert LIN,et al.  NOTE ON FUZZY SETS , 2014 .

[12]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[13]  Ryan P. Browne,et al.  Parsimonious Shifted Asymmetric Laplace Mixtures , 2013, 1311.0317.

[14]  Maurizio Vichi,et al.  Clustering and disjoint principal component analysis , 2009, Comput. Stat. Data Anal..

[15]  Cem Iyigun,et al.  Probabilistic D-Clustering , 2008, J. Classif..

[16]  Dimitris Karlis,et al.  Model-based clustering with non-elliptically contoured distributions , 2009, Stat. Comput..

[17]  P. Kroonenberg Applied Multiway Data Analysis , 2008 .

[18]  Rasmus Bro,et al.  The N-way Toolbox for MATLAB , 2000 .

[19]  P. McNicholas,et al.  Extending mixtures of multivariate t-factor analyzers , 2011, Stat. Comput..

[20]  C. Iyigun Probabilistic Distance Clustering , 2011 .

[21]  Pieter M. Kroonenberg,et al.  Multiplicatieve decompositie van interacties bij oordelen over de werkelijkheidswaarde van televisiefilms [Multiplicative decomposition of interactions for judgements of realism of television films] , 1987 .

[22]  P. Arabie,et al.  Cluster analysis in marketing research , 1994 .

[23]  Geoffrey J. McLachlan,et al.  On mixtures of skew normal and skew $$t$$-distributions , 2012, Adv. Data Anal. Classif..

[24]  Ryan P. Browne,et al.  Mixtures of Variance-Gamma Distributions , 2013, 1309.2695.

[25]  H. Kiers,et al.  Three-mode principal components analysis: choosing the numbers of components and sensitivity to local optima. , 2000, The British journal of mathematical and statistical psychology.

[26]  Paul D. McNicholas,et al.  Parsimonious skew mixture models for model-based clustering and classification , 2013, Comput. Stat. Data Anal..

[27]  Michel van de Velden,et al.  Methods for joint dimension reduction and clustering , 2013 .

[28]  Ruben H. Zamar,et al.  Robust Estimates of Location and Dispersion for High-Dimensional Datasets , 2002, Technometrics.

[29]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[30]  Alain Morineau,et al.  Huge Multidimensional Data Visualization: Back to the Virtue of Principal Coordinates and Dendrograms in the New Computer Age , 2008 .

[31]  Henk A L Kiers,et al.  A fast method for choosing the numbers of components in Tucker3 analysis. , 2003, The British journal of mathematical and statistical psychology.

[32]  Hans-Hermann Bock,et al.  On the Interface between Cluster Analysis, Principal Component Analysis, and Multidimensional Scaling , 1987 .

[33]  H. Kiers,et al.  Factorial k-means analysis for two-way data , 2001 .

[34]  Heungsun Hwang,et al.  An Extension of Multiple Correspondence Analysis for Identifying Heterogeneous Subgroups of Respondents , 2006 .

[35]  Saskia de Craen,et al.  Effects of Group Size and Lack of Sphericity on the Recovery of Clusters in K-means Cluster Analysis , 2006, Multivariate behavioral research.

[36]  Paul D. McNicholas,et al.  Capturing patterns via parsimonious t mixture models , 2013, 1303.2316.

[37]  Ryan P. Browne,et al.  Unsupervised learning via mixtures of skewed distributions with hypercube contours , 2014, Pattern Recognit. Lett..

[38]  Mireille Gettler-Summa,et al.  Factor PD-Clustering , 2013, Algorithms from and for Nature and Life.

[39]  Ryan P. Browne,et al.  Mixtures of skew-t factor analyzers , 2013, Comput. Stat. Data Anal..

[40]  Adrian E. Raftery,et al.  Linear flaw detection in woven textiles using model-based clustering , 1997, Pattern Recognit. Lett..

[41]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[42]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[43]  Ajay K. Manrai,et al.  A New Multidimensional Scaling Methodology for the Analysis of Asymmetric Proximity Data in Marketing Research , 1992 .

[44]  Marina Marino,et al.  Robustness and Stability Analysis of Factor PD-Clustering on Large Social Data Sets , 2014 .

[45]  Geoffrey J. McLachlan,et al.  Extending mixtures of factor models using the restricted multivariate skew-normal distribution , 2013, J. Multivar. Anal..

[46]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[47]  Michael Greenacre,et al.  Exploratory data analysis leading towards the most interesting simple association rules , 2008, Comput. Stat. Data Anal..

[48]  Maurizio Vichi,et al.  A New Dimension Reduction Method: Factor Discriminant K-means , 2011, J. Classif..

[49]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[50]  Paul D. McNicholas,et al.  Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions , 2013, Advances in Data Analysis and Classification.

[51]  Tsung I. Lin,et al.  Maximum likelihood estimation for multivariate skew normal mixture models , 2009, J. Multivar. Anal..

[52]  Tsung I. Lin,et al.  Robust mixture modeling using multivariate skew t distributions , 2010, Stat. Comput..

[53]  J. Vermunt K-means may perform as well as mixture model clustering but may also be much worse: comment on Steinley and Brusco (2011). , 2011, Psychological methods.

[54]  Alfred Ultsch,et al.  Algorithms from and for Nature and Life - Classification and Data Analysis , 2013, Studies in Classification, Data Analysis, and Knowledge Organization.

[55]  Charles Bouveyron,et al.  Simultaneous model-based clustering and visualization in the Fisher discriminative subspace , 2011, Statistics and Computing.

[56]  Eva Ceulemans,et al.  Factorial and reduced K-means reconsidered , 2010, Comput. Stat. Data Anal..

[57]  Boris G. Mirkin,et al.  Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads , 2010, J. Classif..

[58]  Geoffrey J. McLachlan,et al.  Mixtures of Factor Analyzers , 2000, International Conference on Machine Learning.