Model-based clustering using copulas with applications

The majority of model-based clustering techniques is based on multivariate normal models and their variants. In this paper copulas are used for the construction of flexible families of models for clustering applications. The use of copulas in model-based clustering offers two direct advantages over current methods: (i) the appropriate choice of copulas provides the ability to obtain a range of exotic shapes for the clusters, and (ii) the explicit choice of marginal distributions for the clusters allows the modelling of multivariate data of various modes (either discrete or continuous) in a natural way. This paper introduces and studies the framework of copula-based finite mixture models for clustering applications. Estimation in the general case can be performed using standard EM, and, depending on the mode of the data, more efficient procedures are provided that can fully exploit the copula structure. The closure properties of the mixture models under marginalization are discussed, and for continuous, real-valued data parametric rotations in the sample space are introduced, with a parallel discussion on parameter identifiability depending on the choice of copulas for the components. The exposition of the methodology is accompanied and motivated by the analysis of real and artificial data.

[1]  Samuel Kotz,et al.  Corrigendum to The meta-elliptical distributions with given marginals , 2005 .

[2]  Haris Papageorgiou,et al.  Multivariate Discrete Distributions , 2014 .

[3]  N. L. Johnson,et al.  Discrete Multivariate Distributions , 1998 .

[4]  R. Nelsen An Introduction to Copulas (Springer Series in Statistics) , 2006 .

[5]  C. Biernacki,et al.  Model-based clustering of Gaussian copulas for mixed data , 2014, 1405.1299.

[6]  Alexander J. McNeil,et al.  Likelihood inference for Archimedean copulas in high dimensions under known margins , 2012, J. Multivar. Anal..

[7]  P. McNicholas,et al.  Mixtures of modified t-factor analyzers for model-based clustering, classification, and discriminant , 2011 .

[8]  Aristidis K. Nikoloulopoulos,et al.  Vine copulas with asymmetric tail dependence and applications to financial return data , 2012, Comput. Stat. Data Anal..

[9]  Irene Vrbik,et al.  Analytic calculations for the EM algorithm for multivariate skew-t mixture models , 2012 .

[10]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[11]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[12]  Krzysztof Jajuga,et al.  Copula Functions in Model Based Clustering , 2005, GfKl.

[13]  Marius Hofert,et al.  Densities of nested Archimedean copulas , 2012, J. Multivar. Anal..

[14]  Gilles Celeux,et al.  Combining Mixture Components for Clustering , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[15]  Pravin K. Trivedi,et al.  Using Trivariate Copulas to Model Sample Selection and Treatment Effects , 2006 .

[16]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[17]  Eike Christian Brechmann,et al.  Modeling Dependence with C- and D-Vine Copulas: The R Package CDVine , 2013 .

[18]  Claudia Czado,et al.  Pair Copula Constructions for Multivariate Discrete Data , 2012 .

[19]  H. Joe Approximations to Multivariate Normal Rectangle Probabilities Based on Conditional Expectations , 1995 .

[20]  Mathieu Vrac,et al.  Copula analysis of mixture models , 2012, Comput. Stat..

[21]  Jun Yan,et al.  Package R copula : "Multivariate dependence with copulas", version 0.9-7 , 2011 .

[22]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[23]  Satishs Iyengar,et al.  Multivariate Models and Dependence Concepts , 1998 .

[24]  Dimitris Karlis,et al.  Model-based clustering with non-elliptically contoured distributions , 2009, Stat. Comput..

[25]  Florence Forbes,et al.  A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering , 2013, Statistics and Computing.

[26]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[27]  S. Frühwirth-Schnatter,et al.  Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. , 2010, Biostatistics.

[28]  Andrew J. Hanson,et al.  Rotations for N-Dimensional Graphics , 1995 .

[29]  Geoffrey J. McLachlan,et al.  Finite mixtures of multivariate skew t-distributions: some recent and new results , 2014, Stat. Comput..

[30]  Antonello Maruotti,et al.  A finite mixture model for multivariate counts under endogenous selectivity , 2011, Stat. Comput..

[31]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[32]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[33]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[34]  Anirban DasGupta Multivariate Discrete Distributions , 2011 .

[35]  Tsung-I Lin,et al.  Flexible mixture modelling using the multivariate skew-t-normal distribution , 2014, Stat. Comput..

[36]  C. Genest,et al.  A Primer on Copulas for Count Data , 2007, ASTIN Bulletin.

[37]  S. Kotz,et al.  The Meta-elliptical Distributions with Given Marginals , 2002 .

[38]  Andrew J. Hanson,et al.  4 Rotations for N-Dimensional Graphics , 1995 .

[39]  D. Karlis,et al.  Finite mixtures of multivariate Poisson distributions with application , 2007 .

[40]  F. Marta L. Di Lascio,et al.  A Copula-Based Algorithm for Discovering Patterns of Dependent Observations , 2012, J. Classif..

[41]  Murray A. Jorgensen Using Multinomial Mixture Models to Cluster Internet Traffic , 2004 .

[42]  Rebecca Nugent,et al.  Comparing different clustering models on the unit hypercube , 2011 .

[43]  Ryan P. Browne,et al.  Model-based clustering, classification, and discriminant analysis of data with mixed type , 2012 .

[44]  Adrian E. Raftery,et al.  mclust Version 4 for R : Normal Mixture Modeling for Model-Based Clustering , Classification , and Density Estimation , 2012 .

[45]  Arne Henningsen,et al.  maxLik: A package for maximum likelihood estimation in R , 2011, Comput. Stat..

[46]  Paul D. McNicholas,et al.  Dimension reduction for model-based clustering via mixtures of shifted asymmetric Laplace distributions , 2013 .

[47]  M. Cugmas,et al.  On comparing partitions , 2015 .

[48]  Victor H. Lachos,et al.  Multivariate mixture modeling using skew-normal independent distributions , 2012, Comput. Stat. Data Anal..

[49]  Rebecca Nugent,et al.  Clustering student skill set profiles in a unit hypercube using mixtures of multivariate betas , 2013, Adv. Data Anal. Classif..

[50]  T. Bedford,et al.  Vines: A new graphical model for dependent random variables , 2002 .