SOAP: Semantic outliers automatic preprocessing

Abstract Genetic Programming (GP) is an evolutionary algorithm for the automatic generation of symbolic models expressed as syntax trees. GP has been successfully applied in many domain, but most research in this area has not considered the presence of outliers in the training set. Outliers make supervised learning problems difficult, and sometimes impossible, to solve. For instance, robust regression methods cannot handle more than 50% of outlier contamination, referred to as their breakdown point. This paper studies problems where outlier contamination is high, reaching up to 90% contamination levels, extreme cases that can appear in some domains. This work shows, for the first time, that a random population of GP individuals can detect outliers in the output variable. From this property, a new filtering algorithm is proposed called Semantic Outlier Automatic Preprocessing (SOAP), which can be used with any learning algorithm to differentiate between inliers and outliers. Since the method uses a GP population, the algorithm can be carried out for free in a GP symbolic regression system. The approach is the only method that can perform such an automatic cleaning of a dataset without incurring an exponential cost as the percentage of outliers in the dataset increases.

[1]  Eiichiro Fukusaki,et al.  Random sample consensus combined with partial least squares regression (RANSAC-PLS) for microbial metabolomics data mining and phenotype improvement. , 2016, Journal of bioscience and bioengineering.

[2]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[3]  Andrea Marchetti,et al.  Optimal RANSAC-Towards a Repeatable Algorithm for Finding the Optimal Set , 2013, J. WSCG.

[4]  Laurent Navarro,et al.  Progressively adding objectives: a case study in anomaly detection , 2017, GECCO.

[5]  Mengjie Zhang,et al.  Genetic programming based feature construction for classification with incomplete data , 2017, GECCO.

[6]  Marc Parizeau,et al.  DEAP: evolutionary algorithms made easy , 2012, J. Mach. Learn. Res..

[7]  Andrew Zisserman,et al.  MLESAC: A New Robust Estimator with Application to Estimating Image Geometry , 2000, Comput. Vis. Image Underst..

[8]  Andrew Zisserman,et al.  Multiple View Geometry in Computer Vision (2nd ed) , 2003 .

[9]  Krzysztof Krawiec,et al.  Competent Geometric Semantic Genetic Programming for Symbolic Regression and Boolean Function Synthesis , 2017, Evolutionary Computation.

[10]  Richard D. Deveaux,et al.  Applied Smoothing Techniques for Data Analysis , 1999, Technometrics.

[11]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[12]  Krzysztof Krawiec,et al.  Semantic Backpropagation for Designing Search Operators in Genetic Programming , 2015, IEEE Transactions on Evolutionary Computation.

[13]  Mark Kotanchek,et al.  Symbolic Regression Via Genetic Programming as a Discovery Engine: Insights on Outliers and Prototypes , 2010 .

[14]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[15]  Bu-Sung Lee,et al.  Evolutionary multi-objective optimization based ensemble autoencoders for image outlier detection , 2018, Neurocomputing.

[16]  Symeon Papavassiliou,et al.  A holistic approach for personalization, relevance feedback & recommendation in enriched multimedia content , 2016, Multimedia Tools and Applications.

[17]  Leonardo Trujillo,et al.  RANSAC-GP: Dealing with Outliers in Symbolic Regression with Genetic Programming , 2017, EuroGP.

[18]  M. Padberg,et al.  Least trimmed squares regression, least median squares regression, and mathematical programming , 2002 .

[19]  Gisele L. Pappa,et al.  A Dispersion Operator for Geometric Semantic Genetic Programming , 2016, GECCO.

[20]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[21]  Patrick Marques Ciarelli,et al.  Outlier Robust Extreme Learning Machine for Multi-Target Regression , 2019, ArXiv.

[22]  I-Cheng Yeh,et al.  Modeling of strength of high-performance concrete using artificial neural networks , 1998 .

[23]  Krzysztof Krawiec,et al.  Geometric Semantic Genetic Programming , 2012, PPSN.

[24]  J. Bobadilla,et al.  Recommender systems survey , 2013, Knowl. Based Syst..

[25]  Dick den Hertog,et al.  Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming , 2009, IEEE Transactions on Evolutionary Computation.

[26]  Lee Spector,et al.  Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report , 2012, GECCO '12.

[27]  Ronald K. Pearson,et al.  Mining imperfect data - dealing with contamination and incomplete records , 2005 .

[28]  Colin G. Johnson,et al.  Semantic analysis of program initialisation in genetic programming , 2009, Genetic Programming and Evolvable Machines.

[29]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[30]  William B. Langdon,et al.  How Many Good Programs are There? How Long are They? , 2002, FOGA.

[31]  M. Hubert,et al.  High-Breakdown Robust Multivariate Methods , 2008, 0808.0657.

[32]  Leonardo Vanneschi,et al.  A semi-supervised Genetic Programming method for dealing with noisy labels and hidden overfitting , 2017, Swarm Evol. Comput..

[33]  Leonardo Trujillo,et al.  Filtering Outliers in One Step with Genetic Programming , 2018, PPSN.

[34]  Zhizhong Mao,et al.  Detecting outliers for complex nonlinear systems with dynamic ensemble learning , 2019 .

[35]  Leonardo Vanneschi,et al.  ESAGP - A Semantic GP Framework Based on Alignment in the Error Space , 2014, EuroGP.

[36]  Athanasios Tsanas,et al.  Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools , 2012 .

[37]  Robin Nunkesser,et al.  An evolutionary algorithm for robust regression , 2010, Comput. Stat. Data Anal..

[38]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[39]  Gisele L. Pappa,et al.  How noisy data affects geometric semantic genetic programming , 2017, GECCO.

[40]  Leonardo Trujillo,et al.  Prediction of expected performance for a genetic programming classifier , 2016, Genetic Programming and Evolvable Machines.

[41]  Vadlamani Ravi,et al.  Outlier Detection using Evolutionary Computing , 2016, ICIA.

[42]  Carlos Dafonte,et al.  SOM ensemble for unsupervised outlier analysis. Application to outlier identification in the Gaia astronomical survey , 2013, Expert Syst. Appl..

[43]  Leo H. Chiang,et al.  Exploring process data with the use of robust outlier detection algorithms , 2003 .

[44]  Luis Muñoz,et al.  neat Genetic Programming: Controlling bloat naturally , 2016, Inf. Sci..

[45]  Christophe Croux,et al.  Sparse least trimmed squares regression for analyzing high-dimensional large data sets , 2013, 1304.4773.

[46]  J. Ross Quinlan,et al.  Combining Instance-Based and Model-Based Learning , 1993, ICML.

[47]  Symeon Papavassiliou,et al.  Personalized multimedia content retrieval through relevance feedback techniques for enhanced user experience , 2015, 2015 13th International Conference on Telecommunications (ConTEL).

[48]  Ivo Gonçalves,et al.  Balancing Learning and Overfitting in Genetic Programming with Interleaved Sampling of Training Data , 2013, EuroGP.

[49]  Abraham Yosipof,et al.  RANdom SAmple Consensus (RANSAC) algorithm for material-informatics: application to photovoltaic solar cells , 2017, Journal of Cheminformatics.