Population-based Ensemble Learning with Tree Structures for Classification

Ensemble learning is one of the most powerful extensions for improving upon individual machine learning models. Rather than a single model being used, several models are trained and the predictions combined to make a more informed decision. Such combinations will ideally overcome the shortcomings of any individual member of the ensemble. Most machine learning competition winners feature an ensemble of some sort, and there is also sound theoretical proof to the performance of certain ensembling schemes. The benefits of ensembling are clear in both theory and practice. Despite the great performance, ensemble learning is not a trivial task. One of the main difficulties is designing appropriate ensembles. For example, how large should an ensemble be? What members should be included in an ensemble? How should these members be weighted? Our first contribution addresses these concerns using a strongly-typed populationbased search (genetic programming) to construct well-performing ensembles, where the entire ensemble (members, hyperparameters, structure) is automatically learnt. The proposed method was found, in general, to be significantly better than all base members and commonly used comparison methods trialled. With automatically designed ensembles, there is a range of applications, such as competition entries, forecasting and state-of-the-art predictions. However, often these applications also require additional preprocessing of the input data. Above the ensemble considers only the original training data, however, in many machine learning scenarios a pipeline is required (for example performing feature selection before classification). For the second contribution, a novel automated machine learning method is proposed based on ensemble learning. This method uses a random population-based search of appropriate tree structures, and as such is embarrassingly parallel, an important consideration for automated machine learning. The proposed method is able to achieve equivalent or improved results over the current state-of-the-art methods and does so in a fraction of the time (six times as fast). Finally, while complex ensembles offer great performance, one large limitation is the interpretability of such ensembles. For example, why does a forest of 500 trees predict a particular class for a given instance? In an effort to explain the behaviour of complex models (such as ensembles), several methods have been proposed. However, these approaches tend to suffer at least one of the following limitations: overly complex in the representation, local in their application, limited to particular feature types (i.e. categorical only), or limited to particular algorithms. For our third contribution, a novel model agnostic method for interpreting complex black-box machine learning models is proposed. The method is based on strongly-typed genetic programming and overcomes the aforementioned limitations. Multi-objective optimisation is used to generate a Pareto frontier of simple and explainable models which approximate the behaviour of much more complex methods. We found the resulting representations are far simpler than existing approaches (an important consideration for interpretability) while providing equivalent reconstruction performance. Overall, this thesis addresses two of the major limitations of existing ensemble learning, i.e. the complex construction process and the blackbox models that are often difficult to interpret. A novel application of ensemble learning in the field of automated machine learning is also proposed. All three methods have shown at least equivalent or improved performance than existing methods.

[1]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[2]  Aditya K. Ghose,et al.  Explainable Software Analytics , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER).

[3]  Qingquan Song,et al.  Efficient Neural Architecture Search with Network Morphism , 2018, ArXiv.

[4]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[5]  Asifullah Khan,et al.  Combination and optimization of classifiers in gender classification using genetic programming , 2005 .

[6]  RadhaKanta Mahapatra,et al.  Business data mining - a machine learning perspective , 2001, Inf. Manag..

[7]  Alice Zheng,et al.  Evaluating Machine Learning Models , 2019, Machine Learning in the AWS Cloud.

[8]  Kang Lishan,et al.  Balance between exploration and exploitation in genetic search , 2008, Wuhan University Journal of Natural Sciences.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[11]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[12]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[13]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[14]  Hans-Paul Schwefel,et al.  Evolution strategies – A comprehensive introduction , 2002, Natural Computing.

[15]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[16]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[17]  Keki B. Irani,et al.  Multi-interval discretization of continuos attributes as pre-processing for classi cation learning , 1993, IJCAI 1993.

[18]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[19]  Marco Laumanns,et al.  SPEA2: Improving the strength pareto evolutionary algorithm , 2001 .

[20]  Jude W. Shavlik,et al.  in Advances in Neural Information Processing , 1996 .

[21]  Bernard Widrow,et al.  Neural networks: applications in industry, business and science , 1994, CACM.

[22]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[23]  Larry J. Eshelman,et al.  Biases in the Crossover Landscape , 1989, ICGA.

[24]  Latanya Sweeney,et al.  Discrimination in online ad delivery , 2013, CACM.

[25]  William B. Langdon,et al.  Genetic programming for combining neural networks for drug discovery , 2002 .

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  William M. Spears,et al.  Crossover or Mutation? , 1992, FOGA.

[28]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[29]  Yehuda Koren,et al.  All Together Now: A Perspective on the Netflix Prize , 2010 .

[30]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[31]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[32]  Xin Yao,et al.  Ensemble Learning Using Multi-Objective Evolutionary Algorithms , 2006, J. Math. Model. Algorithms.

[33]  Bernhard Sendhoff,et al.  Pareto-Based Multiobjective Machine Learning: An Overview and Case Studies , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[34]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[35]  Randal S. Olson,et al.  Automating Biomedical Data Science Through Tree-Based Pipeline Optimization , 2016, EvoApplications.

[36]  Mengjie Zhang,et al.  Parallel linear genetic programming for multi-class classification , 2012, Genetic Programming and Evolvable Machines.

[37]  G. Ruxton The unequal variance t-test is an underused alternative to Student's t-test and the Mann–Whitney U test , 2006 .

[38]  William B. Langdon,et al.  Genetic programming for combining classifiers , 2001 .

[39]  Paulo J. G. Lisboa,et al.  Making machine learning models interpretable , 2012, ESANN.

[40]  Kenneth A. De Jong,et al.  A formal analysis of the role of multi-point crossover in genetic algorithms , 1992, Annals of Mathematics and Artificial Intelligence.

[41]  William B. Langdon,et al.  Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming! , 1998 .

[42]  Gisele L. Pappa,et al.  RECIPE: A Grammar-Based Framework for Automatically Evolving Classification Pipelines , 2017, EuroGP.

[43]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[44]  D. Karaboga,et al.  On the performance of artificial bee colony (ABC) algorithm , 2008, Appl. Soft Comput..

[45]  Marjan Mernik,et al.  Exploration and exploitation in evolutionary algorithms: A survey , 2013, CSUR.

[46]  Vic Ciesielski,et al.  Representing classification problems in genetic programming , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[47]  Peter I. Frazier,et al.  A Tutorial on Bayesian Optimization , 2018, ArXiv.

[48]  Johannes Gehrke,et al.  Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[49]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[50]  Asifullah Khan,et al.  Combination of support vector machines using genetic programming , 2006, Int. J. Hybrid Intell. Syst..

[51]  N. Hopper,et al.  Analysis of genetic diversity through population history , 1999 .

[52]  Osbert Bastani,et al.  Interpretability via Model Extraction , 2017, ArXiv.

[53]  Guozhong An,et al.  The Effects of Adding Noise During Backpropagation Training on a Generalization Performance , 1996, Neural Computation.

[54]  Klaus-Robert Müller,et al.  Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models , 2017, ArXiv.

[55]  Cynthia Rudin,et al.  Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model , 2015, ArXiv.

[56]  Margo I. Seltzer,et al.  Scalable Bayesian Rule Lists , 2016, ICML.

[57]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[58]  Randal S. Olson,et al.  Identifying and Harnessing the Building Blocks of Machine Learning Pipelines for Sensible Initialization of a Data Science Automation Tool , 2016, GPTP.

[59]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[60]  Guozhu Dong,et al.  Experimental Identification of Hard Data Sets for Classification and Feature Selection Methods with Insights on Method Selection , 2018, Data Knowl. Eng..

[61]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[62]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[63]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[64]  Barbara Hammer,et al.  Interpretable machine learning with reject option , 2018, Autom..

[65]  Lars Kotthoff,et al.  Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA , 2017, J. Mach. Learn. Res..

[66]  A. E. Eiben,et al.  On Evolutionary Exploration and Exploitation , 1998, Fundam. Informaticae.