Clusters, outliers, and regression: fixed point clusters

Fixed point clustering is a new stochastic approach to cluster analysis. The definition of a single fixed point cluster (FPC) is based on a simple parametric model, but there is no parametric assumption for the whole dataset as opposed to mixture modeling and other approaches. An FPC is defined as a data subset that is exactly the set of non-outliers with respect to its own parameter estimators. This paper concentrates upon the theoretical foundation of FPC analysis as a method for clusterwise linear regression, i.e., the single clusters are modeled as linear regressions with normal errors. In this setup, fixed point clustering is based on an iteratively reweighted estimation with zero weight for all outliers. FPCs are non-hierarchical, but they may overlap and include each other. A specification of the number of clusters is not needed. Consistency results are given for certain mixture models of interest in cluster analysis. Convergence of a fixed point algorithm is shown. Application to a real dataset shows that fixed point clustering can highlight some other interesting features of datasets compared to maximum likelihood methods in the presence of deviations from the usual assumptions of model based cluster analysis.

[1]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[2]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[3]  David David Maximum likelihood estimates of the parameters of a mixture of two regression lines , 1974 .

[4]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[5]  Christian Hennig,et al.  Fixed Point Clusters for Linear Regression: Computation and Comparison , 2002, J. Classif..

[6]  Quantile estimation for a selected normal population , 2000 .

[7]  W. DeSarbo,et al.  A mixture likelihood approach for generalized linear models , 1995 .

[8]  Christian Hennig,et al.  Validating visual clusters in large datasets: fixed point clusters of spectral features , 2002 .

[9]  Fitting redescending M-estimators in regression , 1990 .

[10]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[11]  Lawrence M. Seiford,et al.  Recent developments in dea : the mathematical programming approach to frontier analysis , 1990 .

[12]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[13]  A. Gordaliza,et al.  Robustness Properties of k Means and Trimmed k Means , 1999 .

[14]  Christian Hennig,et al.  Identifiablity of Models for Clusterwise Linear Regression , 2000, J. Classif..

[15]  J. A. Cuesta-Albertos,et al.  Trimmed $k$-means: an attempt to robustify quantizers , 1997 .

[16]  P. Groenen,et al.  Data analysis, classification, and related methods , 2000 .

[17]  A. Bowman,et al.  A look at some data on the old faithful geyser , 1990 .

[18]  Ray C. Fair,et al.  Methods of Estimation for Markets in Disequilibrium , 1972 .

[19]  Laurie Davies,et al.  The identification of multiple outliers , 1993 .

[20]  P. L. Davies,et al.  Consistent estimates for finite mixtures of well separated elliptical distributions , 1988 .

[21]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[22]  Dorin Comaniciu,et al.  Mean shift analysis and applications , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[23]  Mia Hubert,et al.  Recent developments in PROGRESS , 1997 .

[24]  Christian Hennig What Clusters Are Generated by Normal Mixtures , 2000 .

[25]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[26]  W. DeSarbo,et al.  A maximum likelihood methodology for clusterwise linear regression , 1988 .

[27]  Hans-Hermann Bock,et al.  Classification and Related Methods of Data Analysis , 1988 .

[28]  A. Raftery,et al.  Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes , 1998 .

[29]  R. Cook,et al.  Identifying Regression Outliers and Mixtures Graphically , 2000 .

[30]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .