Filtering Outliers in One Step with Genetic Programming

Outliers are one of the most difficult issues when dealing with real-world modeling tasks. Even a small percentage of outliers can impede a learning algorithm’s ability to fit a dataset. While robust regression algorithms exist, they fail when a dataset is corrupted by more than 50% of outliers (breakdown point). In the case of Genetic Programming, robust regression has not been properly studied. In this paper we present a method that works as a filter, removing outliers from the target variable (vertical outliers). The algorithm is simple, it uses a randomly generated population of GP trees to determine which target values should be labeled as outliers. The method is highly efficient. Results show that it can return a clean dataset when contamination reaches as high as 90%, and may be able to handle higher levels of contamination. In this study only synthetic univariate benchmarks are used to evaluate the approach, but it must be stressed that no other approaches can deal with such high levels of outlier contamination while requiring such small computational effort.

[1]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[2]  Mengjie Zhang,et al.  Genetic programming based feature construction for classification with incomplete data , 2017, GECCO.

[3]  Ivo Gonçalves,et al.  Balancing Learning and Overfitting in Genetic Programming with Interleaved Sampling of Training Data , 2013, EuroGP.

[4]  Mark Kotanchek,et al.  Symbolic Regression Via Genetic Programming as a Discovery Engine: Insights on Outliers and Prototypes , 2010 .

[5]  Mark Kotanchek,et al.  Pursuing the Pareto Paradigm: Tournaments, Algorithm Variations and Ordinal Optimization , 2007 .

[6]  Andrew Zisserman,et al.  Multiple View Geometry in Computer Vision (2nd ed) , 2003 .

[7]  Andrea Marchetti,et al.  Optimal RANSAC-Towards a Repeatable Algorithm for Finding the Optimal Set , 2013, J. WSCG.

[8]  M. Padberg,et al.  Least trimmed squares regression, least median squares regression, and mathematical programming , 2002 .

[9]  Andrew Zisserman,et al.  MLESAC: A New Robust Estimator with Application to Estimating Image Geometry , 2000, Comput. Vis. Image Underst..

[10]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[11]  Marc Parizeau,et al.  DEAP: evolutionary algorithms made easy , 2012, J. Mach. Learn. Res..

[12]  M. Hubert,et al.  High-Breakdown Robust Multivariate Methods , 2008, 0808.0657.

[13]  Luis Muñoz,et al.  neat Genetic Programming: Controlling bloat naturally , 2016, Inf. Sci..

[14]  Christophe Croux,et al.  Sparse least trimmed squares regression for analyzing high-dimensional large data sets , 2013, 1304.4773.

[15]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[16]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[17]  D. Bertsimas,et al.  Least quantile regression via modern optimization , 2013, 1310.8625.

[18]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[19]  Lee Spector,et al.  Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report , 2012, GECCO '12.

[20]  Gisele L. Pappa,et al.  How noisy data affects geometric semantic genetic programming , 2017, GECCO.

[21]  Leonardo Trujillo,et al.  RANSAC-GP: Dealing with Outliers in Symbolic Regression with Genetic Programming , 2017, EuroGP.

[22]  Ronald K. Pearson,et al.  Mining imperfect data - dealing with contamination and incomplete records , 2005 .