论文信息 - Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm - 字舞流文

Outlier Detection and Data Cleaning in Multivariate Non-Normal Samples: The PAELLA Algorithm

A new method of outlier detection and data cleaning for both normal and non-normal multivariate data sets is proposed. It is based on an iterated local fit without a priori metric assumptions. We propose a new approach supported by finite mixture clustering which provides good results with large data sets. A multi-step structure, consisting of three phases, is developed. The importance of outlier detection in industrial modeling for open-loop control prediction is also described. The described algorithm gives good results both in simulations runs with artificial data sets and with experimental data sets recorded in a rubber factory. Finally, some discussion about this methodology is exposed.

Manuel Castejón Limas | Joaquín B. Ordieres Meré | Francisco J. Martínez de Pisón Ascacibar | Eliseo P. Vergara González | F. J. M. Ascacíbar | E. González | M. C. Limas | E. P. V. González

[1] N. Campbell. Robust Procedures in Multivariate Analysis I: Robust Covariance Estimation , 1980 .

[2] Richard D. De Veaux,et al. Robust estimation of a normal mixture , 1990 .

[3] A. Cuevas,et al. Estimating the number of clusters , 2000 .

[4] A. Hadi,et al. BACON: blocked adaptive computationally efficient outlier nominators , 2000 .

[5] Geoffrey J. McLachlan. On the choice of starting values for the EM algorithm in fitting mixture models , 1988 .

[6] M. Markatou. Mixture Models, Robustness, and the Weighted Likelihood Methodology , 2000, Biometrics.

[7] Adrian E. Raftery,et al. Principal Curve Clustering With Noise , 1997 .

[8] Francisco Javier Martínez de Pisón Ascacíbar,et al. Control de calidad: metodología para el análisis previo a la modelización de datos en procesos industriales, fundamentos teóricos y aplicaciones prácticas con R , 2001 .

[9] C. McGreavy,et al. Data Mining and Knowledge Discovery for Process Monitoring and Control , 1999 .

[10] Adrian E. Raftery,et al. MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[11] Bo Thiesson,et al. Accelerating EM for Large Databases , 2001, Machine Learning.

[12] G. Sawitzki,et al. Using excess mass estimates to investigate the modality of a distribution , 1991 .

[13] G. J. M La,et al. ON COMPUTATIONAL ASPECTS OF CLUSTERING VIA MIXTURES OF NORMAL AND t-COMPONENTS , 1981 .

[14] Adrian E. Raftery,et al. How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[15] Ross Ihaka,et al. Gentleman R: R: A language for data analysis and graphics , 1996 .

[16] David L. Woodruff,et al. Identification of Outliers in Multivariate Data , 1996 .

[17] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18] David M. Rocke,et al. Some computational issues in cluster analysis with no a priori metric , 1999 .

[19] C. Ribeiro,et al. Clustering and clique partitioning: Simulated annealing and tabu search approaches , 1992 .

[20] J. Friedman,et al. Projection Pursuit Regression , 1981 .

[21] A. Hardy. On the number of clusters , 1996 .

[22] G. McLachlan,et al. The EM algorithm and extensions , 1996 .

[23] M. Srivastava,et al. Outliers in Multivariate Regression Models , 1998 .

[24] U. Fayyad,et al. Scaling EM (Expectation Maximization) Clustering to Large Databases , 1998 .

[25] Teresa GallegosFakult,et al. A Robust Method for Clustering Analysis , 2000 .

[26] Peter J. Rousseeuw,et al. Robust regression and outlier detection , 1987 .

[27] Geoffrey J. McLachlan,et al. Robust Cluster Analysis via Mixtures of Multivariate t-Distributions , 1998, SSPR/SPR.

[28] Jeff A. Bilmes,et al. A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[29] A. Cuevas,et al. Cluster analysis: a further approach based on density estimation , 2001 .

[30] John A. Hartigan,et al. Clustering Algorithms , 1975 .

[31] A. Raftery,et al. Model-based Gaussian and non-Gaussian clustering , 1993 .

[32] David L. Woodruff,et al. Robust estimation of multivariate location and shape , 1997 .

[33] Stanley P. Azen,et al. Computational Statistics and Data Analysis (CSDA) , 2006 .