A Genetic Algorithm Based Modification on the LTS Algorithm for Large Data Sets

The authors introduce an algorithm for estimating the least trimmed squares (LTS) parameters in large data sets. The algorithm performs a genetic algorithm search to form a basic subset that is unlikely to contain outliers. Rousseeuw and van Driessen (2006) suggested drawing independent basic subsets and iterating C-steps many times to minimize LTS criterion. The authors 'algorithm constructs a genetic algorithm to form a basic subset and iterates C-steps to calculate the cost value of the LTS criterion. Genetic algorithms are successful methods for optimizing nonlinear objective functions but they are slower in many cases. The genetic algorithm configuration in the algorithm can be kept simple because a small number of observations are searched from the data. An R package is prepared to perform Monte Carlo simulations on the algorithm. Simulation results show that the performance of the algorithm is suitable for even large data sets because a small number of trials is always performed.

[1]  José Julio Espina Agulló New algorithms for computing the least trimmed squares regression estimator , 2001 .

[2]  Walter Krämer,et al.  Review of Modern applied statistics with S, 4th ed. by W.N. Venables and B.D. Ripley. Springer-Verlag 2002 , 2003 .

[3]  A. Hadi,et al.  BACON: blocked adaptive computationally efficient outlier nominators , 2000 .

[4]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[5]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[6]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[7]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[8]  J. Simonoff,et al.  Procedures for the Identification of Multiple Outliers in Linear Models , 1993 .

[9]  David M. Sebert,et al.  A clustering algorithm for identifying multiple outliers in linear regression , 1998 .

[10]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[11]  F. Kianifard,et al.  Using Recursive Residuals, Calculated on Adaptively-Ordered Observations, to Identify Outliers in Linear Regression , 1989 .

[12]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[13]  J. Wisnowski Multiple Outliers in Linear Regression: Advances in Detection Methods, Robust Estimation, and Variable Selection , 1999 .

[14]  Ali S. Hadi,et al.  A Re-Weighted Least Squares Method for Robust Regression Estimation , 2006 .

[15]  PETER J. ROUSSEEUW,et al.  Computing LTS Regression for Large Data Sets , 2005, Data Mining and Knowledge Discovery.

[16]  M. Padberg,et al.  Least trimmed squares regression, least median squares regression, and mathematical programming , 2002 .

[17]  A. C. Atkinson,et al.  Computing least trimmed squares regression with the forward search , 1999, Stat. Comput..

[18]  Mervyn G. Marasinghe A Multistage Procedure for Detecting Several Outliers in Linear Regression , 1985 .

[19]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[20]  Madeleine Walker,et al.  Masking unmasked , 2002, The Journal of audiovisual media in medicine.