A semi-supervised Genetic Programming method for dealing with noisy labels and hidden overfitting

Abstract Data gathered in the real world normally contains noise, either stemming from inaccurate experimental measurements or introduced by human errors. Our work deals with classification data where the attribute values were accurately measured, but the categories may have been mislabeled by the human in several sample points, resulting in unreliable training data. Genetic Programming (GP) compares favorably with the Classification and Regression Trees (CART) method, but it is still highly affected by these errors. Despite consistently achieving high accuracy in both training and test sets, many classification errors are found in a later validation phase, revealing a previously hidden overfitting to the erroneous data. Furthermore, the evolved models frequently output raw values that are far from the expected range. To improve the behavior of the evolved models, we extend the original training set with additional sample points where the class label is unknown, and devise a simple way for GP to use this additional information and learn in a semi-supervised manner. The results are surprisingly good. In the presence of the exact same mislabeling errors, the additional unlabeled data allowed GP to evolve models that achieved high accuracy also in the validation phase. This is a brand new approach to semi-supervised learning that opens an array of possibilities for making the most of the abundance of unlabeled data available today, in a simple and inexpensive way.

[1]  Frank Nielsen,et al.  Loss factorization, weakly supervised learning and label noise robustness , 2016, ICML.

[2]  Ayhan Demiriz,et al.  Semi-Supervised Clustering Using Genetic Algorithms , 1999 .

[3]  Ni-Bin Chang,et al.  Exploring spatiotemporal patterns of phosphorus concentrations in a coastal bay with MODIS images and machine learning models , 2013 .

[4]  Lorenzo Bruzzone,et al.  Active and Semisupervised Learning for the Classification of Remote Sensing Images , 2014, IEEE Transactions on Geoscience and Remote Sensing.

[5]  Dick den Hertog,et al.  Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming , 2009, IEEE Transactions on Evolutionary Computation.

[6]  Xiaoyan Sun,et al.  Interactive genetic algorithms with large population and semi-supervised learning , 2012, Appl. Soft Comput..

[7]  Rabindranath,et al.  Optimized Error Detection Analytics with Bigdata on Cloud , 2016 .

[8]  Rong Jin,et al.  Multiple Kernel Learning from Noisy Labels by Stochastic Programming , 2012, ICML.

[9]  Gisele L. Pappa,et al.  Active Learning Genetic programming for record deduplication , 2010, IEEE Congress on Evolutionary Computation.

[10]  Peter Clark,et al.  Learning from Imperfect Data , 1990 .

[11]  Xindong Wu,et al.  Mining With Noise Knowledge: Error-Aware Data Mining , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[12]  Ying Liu,et al.  A self-trained semisupervised SVM approach to the remote sensing land cover classification , 2013, Comput. Geosci..

[13]  Andrian Marcus,et al.  Data Cleansing: A Prelude to Knowledge Discovery , 2005, Data Mining and Knowledge Discovery Handbook.

[14]  Trevor Darrell,et al.  Auxiliary Image Regularization for Deep CNNs with Noisy Labels , 2015, ICLR.

[15]  Armin Stahl,et al.  Classifier self-assessment: active learning and active noise correction for document classification , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[16]  Leonardo Vanneschi,et al.  Operator equalisation for bloat free genetic programming and a survey of bloat control methods , 2011, Genetic Programming and Evolvable Machines.

[17]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Karel Bartos,et al.  Learning Detector of Malicious Network Traffic from Weak Labels , 2015, ECML/PKDD.

[19]  Toon Calders,et al.  Classification of Historical Notary Acts with Noisy Labels , 2015, ECIR.

[20]  Jianzhong Li,et al.  Cleanix: a Parallel Big Data Cleaning System , 2016, SGMD.

[21]  E. Chuvieco,et al.  Mapping burned areas from Landsat TM/ETM+ data with a two-phase algorithm: Balancing omission and commission errors , 2011 .

[22]  Conor Ryan,et al.  On size, complexity and generalisation error in GP , 2014, GECCO.

[23]  J. Im,et al.  Detection of tropical cyclone genesis via quantitative satellite ocean surface wind pattern and intensity analyses using decision trees , 2016 .

[24]  Alejandro Hinojosa-Corona,et al.  A Genetic Programming Approach to Estimate Vegetation Cover in the Context of Soil Erosion Assessment , 2011 .

[25]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[26]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[27]  João M. N. Silva,et al.  Spectral characterisation and discrimination of burnt areas , 1999 .

[28]  Mark J. Carlotto,et al.  Effect of errors in ground truth on classification accuracy , 2009 .

[29]  Paul M. Mather,et al.  An assessment of the effectiveness of decision tree methods for land cover classification , 2003 .

[30]  Ana C. L. Sá,et al.  An estimate of the area burned in southern Africa during the 2000 dry season using SPOT-VEGETATION satellite data , 2003 .

[31]  Ata Kabán,et al.  Label-Noise Robust Logistic Regression and Its Applications , 2012, ECML/PKDD.

[32]  Jaana M. Hartikainen,et al.  MicroRNA Related Polymorphisms and Breast Cancer Risk , 2014, PloS one.

[33]  Licheng Jiao,et al.  Semisupervised Particle Swarm Optimization for Classification , 2014 .

[34]  Ana C. L. Sá,et al.  Comparison of burned area estimates derived from SPOT-VEGETATION and Landsat ETM+ data in Africa: Influence of spatial pattern and vegetation type , 2005 .

[35]  Sean Luke,et al.  Lexicographic Parsimony Pressure , 2002, GECCO.

[36]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[37]  Giles M. Foody,et al.  The effect of mis-labeled training data on the accuracy of supervised image classification by SVM , 2015, 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS).

[38]  G. Olague,et al.  Mapping erosion risk at the basin scale in a Mediterranean environment with opencast coal mines to target restoration actions , 2012, Regional Environmental Change.

[39]  Amir Hossein Alavi,et al.  Machine learning in geosciences and remote sensing , 2016 .

[40]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[41]  Sotiris B. Kotsiantis,et al.  Decision trees: a recent overview , 2011, Artificial Intelligence Review.

[42]  Dimitris Samaras,et al.  Noisy Label Recovery for Shadow Detection in Unfamiliar Domains , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Fernando E. B. Otero,et al.  Genetic Programming for Attribute Construction in Data Mining , 2002, EuroGP.

[44]  Jurandy Almeida,et al.  Deriving vegetation indices for phenology analysis using genetic programming , 2015, Ecol. Informatics.

[45]  Nikos Koutsias,et al.  A rule-based semi-automatic method to map burned areas: exploring the USGS historical Landsat archives to reconstruct recent fire history , 2013 .

[46]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[47]  Teri A. Crosby,et al.  How to Detect and Handle Outliers , 1993 .

[48]  Subramanian Ramanathan,et al.  Active domain adaptation with noisy labels for multimedia analysis , 2016, World Wide Web.

[49]  Yang Liu,et al.  Study of Burn Scar Extraction Automatically Based on Level Set Method using Remote Sensing Data , 2014, PloS one.

[50]  Sidnei J. S. Sant'Anna,et al.  Semi-supervised remote sensing image classification methods assessment , 2011, 2011 IEEE International Geoscience and Remote Sensing Symposium.

[51]  John A. Richards,et al.  Remote Sensing Digital Image Analysis: An Introduction , 1999 .

[52]  Sara Silva,et al.  Bloat Free Genetic Programming versus Classification Trees for Identification of Burned Areas in Satellite Imagery , 2010, EvoApplications.

[53]  Cyril Fonlupt,et al.  Backwarding : An Overfitting Control for Genetic Programming in a Remote Sensing Application , 2001, Artificial Evolution.

[54]  Shiliang Sun,et al.  Evolutionary classifier ensembles for semi-supervised learning , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[55]  Rudolf Kruse,et al.  Semi-supervised learning in knowledge discovery , 2005, Fuzzy Sets Syst..

[56]  Jun Li,et al.  A novel semi-supervised hyperspectral image classification approach based on spatial neighborhood information and classifier combination , 2015 .

[57]  Conor Ryan,et al.  GEML: Evolutionary unsupervised and semi-supervised learning of multi-class classification with Grammatical Evolution , 2015, 2015 7th International Joint Conference on Computational Intelligence (IJCCI).

[58]  Ying Wang,et al.  Semi-supervised classification for hyperspectral imagery based on spatial-spectral Label Propagation , 2014 .

[59]  C. Brodley,et al.  Decision tree classification of land cover from remotely sensed data , 1997 .

[60]  José M. C. Pereira,et al.  A Rule-Based System for Burned Area Mapping in Temperate and Tropical Regions Using NOAA/AVHRR Imagery , 2000 .

[61]  Ujjwal Maulik,et al.  Learning with transductive SVM for semisupervised pixel classification of remote sensing imagery , 2013 .

[62]  Matthias Hein,et al.  Correction of noisy labels via mutual consistency check , 2015, Neurocomputing.

[63]  Hailong Sun,et al.  Spectral Label Refinement for Noisy and Missing Text Labels , 2015, AAAI.

[64]  Meng Wang,et al.  Robust Non-negative Graph Embedding: Towards noisy data, unreliable graphs, and noisy labels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Zhiwu Lu,et al.  Learning from Weak and Noisy Labels for Semantic Segmentation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Xiaorui Ma,et al.  Semisupervised classification for hyperspectral image based on multi-decision labeling and deep feature learning , 2016 .

[67]  Qian Du,et al.  An efficient semi-supervised classification approach for hyperspectral imagery , 2014 .

[68]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[69]  Yuriy Brun,et al.  Preventing data errors with continuous testing , 2015, ISSTA.

[70]  Michele Dalponte,et al.  Semi-supervised SVM for individual tree crown species classification , 2015 .

[71]  Ivan Koychev,et al.  A Semi-Supervised Multi-view Genetic Algorithm , 2014, 2014 2nd International Conference on Artificial Intelligence, Modelling and Simulation.

[72]  Jefersson Alex dos Santos,et al.  A relevance feedback method based on genetic programming for classification of remote sensing images , 2011, Inf. Sci..

[73]  Xiaogang Wang,et al.  Learning from massive noisy labeled data for image classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Giles M. Foody,et al.  Status of land cover classification accuracy assessment , 2002 .

[75]  José Luis Montaña,et al.  Penalty Functions for Genetic Programming Algorithms , 2011, ICCSA.

[76]  Tyler Lu,et al.  Fundamental Limitations of Semi-Supervised Learning , 2009 .

[77]  G. Foody Assessing the accuracy of land cover change with imperfect ground reference data , 2010 .

[78]  Tara Javidi,et al.  Active learning from noisy and abstention feedback , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[79]  Mohamed Cheriet,et al.  Genetic algorithm–based training for semi-supervised SVM , 2010, Neural Computing and Applications.

[80]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[81]  Emmanuel Ramasso,et al.  Weighted Maximum Likelihood for Parameters Learning Based on Noisy Labels in Discrete Hidden Markov Models , 2015, ECSQARU.

[82]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[83]  Arturo E. Melchiori,et al.  A Landsat-TM/OLI algorithm for burned areas in the Brazilian Cerrado: preliminary results , 2014 .

[84]  Geoffrey E. Hinton,et al.  Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[85]  Ayhan Demiriz,et al.  A Genetic Algorithm Approach for Semi-Supervised Clustering , 2002 .

[86]  Lee Dee Miller,et al.  Genetic Algorithm Classifier System for Semi‐Supervised Learning , 2015, Comput. Intell..

[87]  Giles M. Foody,et al.  Ground reference data error and the mis-estimation of the area of land cover change as a function of its abundance , 2013 .

[88]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[89]  Yue Wang,et al.  Error Diagnosis and Data Profiling with Data X-Ray , 2015, Proc. VLDB Endow..

[90]  N. Chang,et al.  Seasonal change detection of riparian zones with remote sensing images and genetic programming in a semi-arid watershed. , 2009, Journal of environmental management.

[91]  Francisco Herrera,et al.  Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study , 2015, Knowledge and Information Systems.

[92]  Lucy Bastin,et al.  The Sensitivity of Mapping Methods to Reference Data Quality: Training Supervised Image Classifications with Imperfect Reference Data , 2016, ISPRS Int. J. Geo Inf..

[93]  Przemysław Głomb,et al.  Semi-supervised hyperspectral classification from a small number of training samples using a co-training approach , 2016 .

[94]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[95]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[96]  Antanas Verikas,et al.  Agreeing to disagree: active learning with noisy labels without crowdsourcing , 2017, International Journal of Machine Learning and Cybernetics.

[97]  Ata Kabán,et al.  Learning a Label-Noise Robust Logistic Regression: Analysis and Experiments , 2013, IDEAL.

[98]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[99]  John R. Koza,et al.  Human-competitive results produced by genetic programming , 2010, Genetic Programming and Evolvable Machines.

[100]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[101]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[102]  C. V. Jawahar,et al.  Image Annotation in Presence of Noisy Labels , 2013, PReMI.

[103]  Gisele L. Pappa,et al.  Semi-supervised genetic programming for classification , 2011, GECCO '11.

[104]  Leonardo Vanneschi,et al.  Measuring bloat, overfitting and functional complexity in genetic programming , 2010, GECCO '10.

[105]  Panagiotis G. Ipeirotis,et al.  Repeated labeling using multiple noisy labelers , 2012, Data Mining and Knowledge Discovery.