Assessing trimming methodologies for clustering linear regression data

We assess the performance of state-of-the-art robust clustering tools for regression structures under a variety of different data configurations. We focus on two methodologies that use trimming and restrictions on group scatters as their main ingredients. We also give particular care to the data generation process through the development of a flexible simulation tool for mixtures of regressions, where the user can control the degree of overlap between the groups. Level of trimming and restriction factors are input parameters for which appropriate tuning is required. Since we find that incorrect specification of the second-level trimming in the Trimmed CLUSTering REGression model (TCLUST-REG) can deteriorate the performance of the method, we propose an improvement where the second-level trimming is not fixed in advance but is data dependent. We then compare our adaptive version of TCLUST-REG with the Trimmed Cluster Weighted Restricted Model (TCWRM) which provides a powerful extension of the robust clusterwise regression methodology. Our overall conclusion is that the two methods perform comparably, but with notable differences due to the inherent degree of modeling implied by them.

[1]  Christian Hennig,et al.  Clusters, outliers, and regression: fixed point clusters , 2003 .

[2]  Luis Angel García-Escudero,et al.  Computational Statistics and Data Analysis Robust Clusterwise Linear Regression through Trimming , 2022 .

[3]  Francesca Torti,et al.  Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library , 2015, Adv. Data Anal. Classif..

[4]  Giuliano Galimberti,et al.  Classification Trees for Ordinal Responses in R: The rpartScore Package , 2012 .

[5]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[6]  W. DeSarbo,et al.  A maximum likelihood methodology for clusterwise linear regression , 1988 .

[7]  Francesca Torti,et al.  Discussion of “The power of monitoring: how to make the most of a contaminated multivariate sample” , 2018, Stat. Methods Appl..

[8]  Domenico Perrotta,et al.  Robust clustering around regression lines with high density regions , 2013, Advances in Data Analysis and Classification.

[9]  Peter Filzmoser,et al.  Robust fitting of mixtures using the trimmed likelihood estimator , 2007, Comput. Stat. Data Anal..

[10]  Luis Angel García-Escudero,et al.  Comments on “The power of monitoring: how to make the most of a contaminated multivariate sample” , 2018, Stat. Methods Appl..

[11]  Lucio Barabesi,et al.  Modeling international trade data with the Tweedie distribution for anti-fraud and policy support , 2016, Eur. J. Oper. Res..

[12]  N. Campbell Mixture models and atypical values , 1984 .

[13]  Andrea Cerioli,et al.  Multivariate Outlier Detection With High-Breakdown Estimators , 2010 .

[14]  Domenico Perrotta,et al.  The Forward Search for Very Large Datasets , 2015 .

[15]  Daniel Peña,et al.  Outlier detection and robust estimation in linear regression models with fixed group effects , 2014 .

[16]  B. Hérault,et al.  entropart : An R Package to Measure and Partition Diversity , 2015 .

[17]  Wei-Chen Chen,et al.  MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms , 2012 .

[18]  Luis Angel García-Escudero,et al.  The joint role of trimming and constraints in robust estimation for mixtures of Gaussian factor analyzers , 2016, Comput. Stat. Data Anal..

[19]  Luis Angel García-Escudero,et al.  tclust: An R Package for a Trimming Approach to Cluster Analysis , 2012 .

[20]  Luis Angel García-Escudero,et al.  Finding the Number of Normal Groups in Model-Based Clustering via Constrained Likelihoods , 2018 .

[21]  C. Matr'an,et al.  A general trimming approach to robust Cluster Analysis , 2008, 0806.2976.

[22]  Alessio Farcomeni,et al.  The power of (extended) monitoring in robust clustering , 2018, Stat. Methods Appl..

[23]  Adrian E. Raftery,et al.  Linear flaw detection in woven textiles using model-based clustering , 1997, Pattern Recognit. Lett..

[24]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[25]  Luis Angel García-Escudero,et al.  A fast algorithm for robust constrained clustering , 2013, Comput. Stat. Data Anal..

[26]  Luis Angel García-Escudero,et al.  Robust estimation of mixtures of regressions with random covariates, via trimming and constraints , 2017, Stat. Comput..

[27]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[28]  Neil Gershenfeld,et al.  Nonlinear Inference and Cluster‐Weighted Modeling , 1997 .

[29]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[30]  A. Atkinson,et al.  Finding an unknown number of multivariate outliers , 2009 .

[31]  Lucio Barabesi,et al.  A new family of tempered distributions , 2016 .

[32]  Luis Angel García-Escudero,et al.  A reweighting approach to robust clustering , 2017, Statistics and Computing.

[33]  Francesca Torti,et al.  FSDA: A MATLAB toolbox for robust analysis and interactive data exploration , 2012 .

[34]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[35]  A. Gordaliza Best approximations to random variables based on trimming procedures , 1991 .

[36]  Giorgio Vittadini,et al.  Local Statistical Modeling via a Cluster-Weighted Approach with Elliptical Distributions , 2012, J. Classif..

[37]  N. Gershenfeld,et al.  Cluster-weighted modelling for time-series analysis , 1999, Nature.

[38]  Anthony C. Atkinson,et al.  The power of monitoring: how to make the most of a contaminated multivariate sample , 2018, Stat. Methods Appl..

[39]  Andrea Cerioli,et al.  Outlier-free merging of homogeneous groups of pre-classified observations under contamination , 2017 .

[40]  P. Rousseeuw Least Median of Squares Regression , 1984 .