Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm

A new method of outlier detection and data cleaning for both normal and non-normal multivariate data sets is proposed. It is based on an iterated local fit without a priori metric assumptions. We propose a new approach supported by finite mixture clustering which provides good results with large data sets. A multi-step structure, consisting of three phases, is developed. The importance of outlier detection in industrial modeling for open-loop control prediction is also described. The described algorithm gives good results both in simulations runs with artificial data sets and with experimental data sets recorded in a rubber factory. Finally, some discussion about this methodology is exposed.

[1]  N. Campbell Robust Procedures in Multivariate Analysis I: Robust Covariance Estimation , 1980 .

[2]  Richard D. De Veaux,et al.  Robust estimation of a normal mixture , 1990 .

[3]  A. Cuevas,et al.  Estimating the number of clusters , 2000 .

[4]  A. Hadi,et al.  BACON: blocked adaptive computationally efficient outlier nominators , 2000 .

[5]  Geoffrey J. McLachlan On the choice of starting values for the EM algorithm in fitting mixture models , 1988 .

[6]  M. Markatou Mixture Models, Robustness, and the Weighted Likelihood Methodology , 2000, Biometrics.

[7]  Adrian E. Raftery,et al.  Principal Curve Clustering With Noise , 1997 .

[8]  Francisco Javier Martínez de Pisón Ascacíbar,et al.  Control de calidad: metodología para el análisis previo a la modelización de datos en procesos industriales, fundamentos teóricos y aplicaciones prácticas con R , 2001 .

[9]  C. McGreavy,et al.  Data Mining and Knowledge Discovery for Process Monitoring and Control , 1999 .

[10]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[11]  Bo Thiesson,et al.  Accelerating EM for Large Databases , 2001, Machine Learning.

[12]  G. Sawitzki,et al.  Using excess mass estimates to investigate the modality of a distribution , 1991 .

[13]  G. J. M La,et al.  ON COMPUTATIONAL ASPECTS OF CLUSTERING VIA MIXTURES OF NORMAL AND t-COMPONENTS , 1981 .

[14]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[15]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[16]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  David M. Rocke,et al.  Some computational issues in cluster analysis with no a priori metric , 1999 .

[19]  C. Ribeiro,et al.  Clustering and clique partitioning: Simulated annealing and tabu search approaches , 1992 .

[20]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[21]  A. Hardy On the number of clusters , 1996 .

[22]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[23]  M. Srivastava,et al.  Outliers in Multivariate Regression Models , 1998 .

[24]  U. Fayyad,et al.  Scaling EM (Expectation Maximization) Clustering to Large Databases , 1998 .

[25]  Teresa GallegosFakult,et al.  A Robust Method for Clustering Analysis , 2000 .

[26]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[27]  Geoffrey J. McLachlan,et al.  Robust Cluster Analysis via Mixtures of Multivariate t-Distributions , 1998, SSPR/SPR.

[28]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[29]  A. Cuevas,et al.  Cluster analysis: a further approach based on density estimation , 2001 .

[30]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[31]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[32]  David L. Woodruff,et al.  Robust estimation of multivariate location and shape , 1997 .

[33]  Stanley P. Azen,et al.  Computational Statistics and Data Analysis (CSDA) , 2006 .