Asymmetric clusters and outliers: Mixtures of multivariate contaminated shifted asymmetric Laplace distributions

Mixtures of multivariate contaminated shifted asymmetric Laplace distributions are developed for handling asymmetric clusters in the presence of outliers (also referred to as bad points herein). In addition to the parameters of the related non-contaminated mixture, for each (asymmetric) cluster, our model has one parameter controlling the proportion of outliers and one specifying the degree of contamination. Crucially, these parameters do not have to be specified a priori, adding a flexibility to our approach that is absent from other approaches such as trimming. Moreover, each observation is given a posterior probability of belonging to a particular cluster, and of being an outlier or not; advantageously, this allows for the automatic detection of outliers. An expectation-conditional maximization algorithm is outlined for parameter estimation and various implementation issues are discussed. The behaviour of the proposed model is investigated, and compared with well-established finite mixtures, on artificial and real data.

[1]  Victor H. Lachos,et al.  Robust mixture modeling based on scale mixtures of skew-normal distributions , 2010, Comput. Stat. Data Anal..

[2]  Peter M. Bentler,et al.  Estimation of Contamination Parameters and Identification of Outliers in Multivariate Data , 1988 .

[3]  Ryan P. Browne,et al.  A mixture of SDB skew-t factor analyzers , 2013, 1310.6224.

[4]  Adelchi Azzalini,et al.  The Skew-Normal and Related Families , 2018 .

[5]  Nuno Gonçalo Costa Fernandes Marques de Abreu Análise do perfil do cliente Recheio e desenvolvimento de um sistema promocional , 2011 .

[6]  Paul D. McNicholas,et al.  Model-Based Clustering , 2016, Journal of Classification.

[7]  M. Aitkin,et al.  Mixture Models, Outliers, and the EM Algorithm , 1980 .

[8]  Geoffrey J. McLachlan,et al.  Finite mixtures of multivariate skew t-distributions: some recent and new results , 2014, Stat. Comput..

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[11]  Wan-Lun Wang,et al.  Mixtures of restricted skew-t factor analyzers with common factor loadings , 2018, Advances in Data Analysis and Classification.

[12]  H. Bolfarine,et al.  Skew scale mixtures of normal distributions: Properties and estimation , 2011 .

[13]  Irene Vrbik,et al.  Analytic calculations for the EM algorithm for multivariate skew-t mixture models , 2012 .

[14]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[15]  Ryan P. Browne,et al.  Mixtures of Shifted Asymmetric Laplace Distributions , 2012 .

[16]  R. S. J. Sparks,et al.  Bimodal grain size distribution and secondary thickening in air-fall ash layers , 1983, Nature.

[17]  Filidor V. Labra,et al.  Multivariate skew-normal/independent distributions: properties and inference , 2014 .

[18]  Tsung-I Lin,et al.  Finite mixture modelling using the skew normal distribution , 2007 .

[19]  Antonio Punzo,et al.  Finite mixtures of unimodal beta and gamma densities and the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{d , 2012, Computational Statistics.

[20]  Victor H. Lachos,et al.  Multivariate mixture modeling using skew-normal independent distributions , 2012, Comput. Stat. Data Anal..

[21]  Ryan P. Browne,et al.  Hypothesis Testing for Mixture Model Selection , 2016 .

[22]  Ryan P. Browne,et al.  Mixtures of Shifted AsymmetricLaplace Distributions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Paul D. McNicholas,et al.  Clustering with the multivariate normal inverse Gaussian distribution , 2016, Comput. Stat. Data Anal..

[24]  Paul D. McNicholas,et al.  Finite mixtures of skewed matrix variate distributions , 2018, Pattern Recognit..

[25]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[26]  Paul D. McNicholas,et al.  ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions , 2016, 1606.03766.

[27]  Three Skewed Matrix Variate Distributions , 2017, 1704.02531.

[28]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[29]  R J Carroll,et al.  Analysis of tomato root initiation using a normal mixture distribution. , 1995, Biometrics.

[30]  L. Bagnato,et al.  The multivariate leptokurtic‐normal distribution and its application in model‐based clustering , 2017 .

[31]  Ryan P. Browne,et al.  Mixtures of skew-t factor analyzers , 2013, Comput. Stat. Data Anal..

[32]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[33]  Knut Westlund,et al.  One cause? Many causes? The argument from the bimodal distribution , 1964 .

[34]  Jill P. Mesirov,et al.  Automated High-Dimensional Flow Cytometric Data Analysis , 2010, RECOMB.

[35]  Raphael Gottardo,et al.  Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution , 2010, Statistics and Computing.

[36]  Tsung I. Lin,et al.  Maximum likelihood estimation for multivariate skew normal mixture models , 2009, J. Multivar. Anal..

[37]  Tsung I. Lin,et al.  Robust mixture modeling using multivariate skew t distributions , 2010, Stat. Comput..

[38]  G. McLachlan,et al.  Extensions of the EM Algorithm , 2007 .

[39]  Paul D. McNicholas,et al.  Robust Clustering in Regression Analysis via the Contaminated Gaussian Cluster-Weighted Model , 2014, J. Classif..

[40]  Christophe Biernacki,et al.  Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models , 2003, Comput. Stat. Data Anal..

[41]  Yang Tang,et al.  Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data , 2017, Comput. Stat. Data Anal..

[42]  Paul D. McNicholas,et al.  Clustering and classification via cluster-weighted factor analyzers , 2012, Advances in Data Analysis and Classification.

[43]  Volodymyr Melnykov,et al.  Manly Mixture Modeling and Model-Based Clustering , 2015 .

[44]  Tsung-I Lin,et al.  Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition , 2014, Comput. Stat. Data Anal..

[45]  A. Azzalini The Skew‐normal Distribution and Related Multivariate Families * , 2005 .

[46]  Paul D. McNicholas,et al.  Cluster-weighted $$t$$t-factor analyzers for robust model-based clustering and dimension reduction , 2015, Stat. Methods Appl..

[47]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[48]  F. Liang,et al.  Robust Clustering Using Exponential Power Mixtures , 2010, Biometrics.

[49]  Dimitris Karlis,et al.  Choosing Initial Values for the EM Algorithm for Finite Mixtures , 2003, Comput. Stat. Data Anal..

[50]  Nicholas J. Schork,et al.  Skewness and mixtures of normal distributions , 1988 .

[51]  Roberta Paroli,et al.  Gaussian Hidden Markov Models for the Analysis of the Dynamics of Sulphur Dioxide , 2000 .

[52]  A. Maruotti,et al.  Clustering Multivariate Longitudinal Observations: The Contaminated Gaussian Hidden Markov Model , 2016 .

[53]  Antonello Maruotti,et al.  Fitting insurance and economic data with outliers: a flexible approach based on finite mixtures of contaminated gamma distributions , 2018 .

[54]  G. McLachlan,et al.  Advances in Data Analysis and Classification , 2015 .

[55]  Sharon X. Lee,et al.  Robust mixtures of factor analysis models using the restricted multivariate skew-t distribution , 2018 .

[56]  Ryan P. Browne,et al.  A mixture of generalized hyperbolic distributions , 2013, 1305.1036.

[57]  Wan-Lun Wang,et al.  Flexible clustering via extended mixtures of common t-factor analyzers , 2017 .

[58]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[59]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[60]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[61]  P. McNicholas,et al.  Outlier Detection via Parsimonious Mixtures of Contaminated Gaussian Distributions , 2013 .

[62]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[63]  Ryan P. Browne,et al.  Flexible clustering of high‐dimensional data via mixtures of joint generalized hyperbolic distributions , 2017, 1705.03130.

[64]  A. Punzo,et al.  Mixtures of multivariate contaminated normal regression models , 2020 .

[65]  A. C. Aitken XXV.—On Bernoulli's Numerical Solution of Algebraic Equations , 1927 .

[66]  Antonio Punzo,et al.  A new look at the inverse Gaussian distribution with applications to insurance and economic data , 2017, Journal of Applied Statistics.

[67]  Ryan P. Browne,et al.  Multivariate Response and Parsimony for Gaussian Cluster-Weighted Models , 2014, Journal of Classification.

[68]  Samuel Kotz,et al.  The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance , 2001 .

[69]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[70]  P. McNicholas,et al.  A Mixture of Variance-Gamma Factor Analyzers , 2017 .

[71]  Geoffrey J. McLachlan,et al.  Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution , 2007, Comput. Stat. Data Anal..

[72]  Luca Scrucca,et al.  mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models , 2016, R J..

[73]  Antonello Maruotti,et al.  Compound unimodal distributions for insurance losses , 2017, Insurance: Mathematics and Economics.

[74]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[75]  Paul D. McNicholas,et al.  Parsimonious skew mixture models for model-based clustering and classification , 2013, Comput. Stat. Data Anal..

[76]  Antonello Maruotti,et al.  Model-based time-varying clustering of multivariate longitudinal data with covariates and outliers , 2017, Comput. Stat. Data Anal..

[77]  Min Liu,et al.  Robust skew-t factor analysis models for handling missing data , 2017, Stat. Methods Appl..

[78]  B. Lindsay,et al.  The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family , 1994 .

[79]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[80]  Paul D. McNicholas,et al.  Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions , 2011, Statistics and Computing.

[81]  W. Chan,et al.  Unimodality, convexity, and applications , 1989 .

[82]  P. McNicholas,et al.  Robust High-Dimensional Modeling with the Contaminated Gaussian Distribution , 2014 .

[83]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[84]  Dimitris Karlis,et al.  Model-based clustering with non-elliptically contoured distributions , 2009, Stat. Comput..

[85]  P. McNicholas Mixture Model-Based Classification , 2016 .

[86]  Tsung-I Lin,et al.  Computationally efficient learning of multivariate t mixture models with missing information , 2009, Comput. Stat..

[87]  L. M. Berliner,et al.  Robust Bayes and Empirical Bayes Analysis with #-Contaminated Priors , 2007 .

[88]  Luis Angel García-Escudero,et al.  The influence function of the TCLUST robust clustering procedure , 2012, Adv. Data Anal. Classif..

[89]  P. McNicholas,et al.  Extending mixtures of multivariate t-factor analyzers , 2011, Stat. Comput..

[90]  Kui Wang,et al.  Multivariate Skew t Mixture Models: Applications to Fluorescence-Activated Cell Sorting Data , 2009, 2009 Digital Image Computing: Techniques and Applications.

[91]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[92]  Edward I. Altman,et al.  FINANCIAL RATIOS, DISCRIMINANT ANALYSIS AND THE PREDICTION OF CORPORATE BANKRUPTCY , 1968 .

[93]  Volodymyr Melnykov,et al.  Initializing the EM algorithm in Gaussian mixture models with an unknown number of components , 2012, Comput. Stat. Data Anal..

[94]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[95]  V. H. Lachos,et al.  mixsmsn: Fitting Finite Mixture of Scale Mixture of Skew-Normal Distributions , 2013 .

[96]  L. M. Berliner,et al.  Robust Bayes and Empirical Bayes Analysis with $_\epsilon$-Contaminated Priors , 1986 .

[97]  R. Arellano-Valle,et al.  LIKELIHOOD BASED INFERENCE FOR SKEW-NORMAL INDEPENDENT LINEAR MIXED MODELS , 2010 .

[98]  Ryan P. Browne,et al.  Mixtures of multivariate power exponential distributions , 2015, Biometrics.

[99]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[100]  Ryan P. Browne,et al.  Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering , 2017, J. Multivar. Anal..

[101]  Paul D. McNicholas,et al.  Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models , 2010, Comput. Stat. Data Anal..

[102]  Hsiu J. Ho,et al.  On fast supervised learning for normal mixture models with missing information , 2006, Pattern Recognit..

[103]  Paul D. McNicholas,et al.  Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions , 2013, Advances in Data Analysis and Classification.