Evolutionary Machine Learning for Classification with Incomplete Data

Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming—genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.

[1]  Haijia Shi Best-first Decision Tree Learning , 2007 .

[2]  Cheng-Lung Huang,et al.  A distributed PSO-SVM hybrid system with feature selection and parameter optimization , 2008, Appl. Soft Comput..

[3]  Esther-Lydia Silva-Ramírez,et al.  Missing value imputation on missing completely at random data using multilayer perceptrons , 2011, Neural Networks.

[4]  Thomas E. McKee,et al.  Bankruptcy theory development and classification via genetic programming , 2006, Eur. J. Oper. Res..

[5]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[6]  Mengjie Zhang,et al.  Impact of imputation of missing values on genetic programming based multiple feature construction for classification , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[7]  Xiao-Hua Zhou,et al.  Multiple imputation: review of theory, implementation and software , 2007, Statistics in medicine.

[8]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[9]  Lalit M. Patnaik,et al.  Application of genetic programming for multicategory pattern classification , 2000, IEEE Trans. Evol. Comput..

[10]  Asoke K. Nandi,et al.  Fault detection using genetic programming , 2005 .

[11]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[12]  Mengjie Zhang,et al.  Multiple Imputation for Missing Data Using Genetic Programming , 2015, GECCO.

[13]  Mengjie Zhang,et al.  A Filter Approach to Multiple Feature Construction for Symbolic Learning Classifiers Using Genetic Programming , 2012, IEEE Transactions on Evolutionary Computation.

[14]  Mengjie Zhang,et al.  Directly Constructing Multiple Features for Classification with Missing Data using Genetic Programming with Interval Functions , 2016, GECCO.

[15]  Krzysztof Krawiec,et al.  Genetic Programming-based Construction of Features for Machine Learning and Knowledge Discovery Tasks , 2002, Genetic Programming and Evolvable Machines.

[16]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[17]  Mengjie Zhang,et al.  Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach , 2013, IEEE Transactions on Cybernetics.

[18]  Amaury Lendasse,et al.  Regularized extreme learning machine for regression with missing data , 2013, Neurocomputing.

[19]  Erode India skuppu,et al.  A Genetic Algorithm Based Approach for Imputing Missing Discrete Attribute values in Databases , 2012 .

[20]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[21]  Nikos Tsikriktsis,et al.  A review of techniques for treating missing data in OM survey research , 2005 .

[22]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[23]  Athanasios Tsakonas,et al.  A comparison of classification accuracy of four genetic programming-evolved intelligent structures , 2006, Inf. Sci..

[24]  Mengjie Zhang,et al.  Single Feature Ranking and Binary Particle Swarm Optimisation Based Feature Subset Ranking for Feature Selection , 2012, ACSC.

[25]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  P. Meesad,et al.  Combination of KNN-Based Feature Selection and KNNBased Missing-Value Imputation of Microarray Data , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[27]  Peter K. Sharpe,et al.  Dealing with missing values in neural network-based diagnostic systems , 1995, Neural Computing & Applications.

[28]  David R. Musser,et al.  Introspective Sorting and Selection Algorithms , 1997, Softw. Pract. Exp..

[29]  Yousung Park,et al.  A new multiple imputation method for bounded missing values , 2015 .

[30]  Shichao Zhang,et al.  Clustering-based Missing Value Imputation for Data Preprocessing , 2006, 2006 4th IEEE International Conference on Industrial Informatics.

[31]  Marco Zaffalon,et al.  Bayesian network data imputation with application to survival tree analysis , 2016, Comput. Stat. Data Anal..

[32]  Yiwen Zhang,et al.  Multi-granulation Ensemble Classification for Incomplete Data , 2014, RSKT.

[33]  Qiangfu Zhao,et al.  Designing smaller decision trees using multiple objective optimization based GPs , 2002, IEEE International Conference on Systems, Man and Cybernetics.

[34]  Yelipe UshaRani,et al.  A Novel Approach for Imputation of Missing Attribute Values for Efficient Mining of Medical Datasets - Class Based Cluster Approach , 2016, ArXiv.

[35]  Amaury Lendasse,et al.  Extreme learning machine for missing data using multiple imputations , 2016, Neurocomputing.

[36]  Dervis Karaboga,et al.  A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm , 2007, J. Glob. Optim..

[37]  David W. Opitz,et al.  Feature Selection for Ensembles , 1999, AAAI/IAAI.

[38]  Steven D. Brown,et al.  Comparison of five iterative imputation methods for multivariate classification , 2013 .

[39]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[40]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[41]  Jerome P. Reiter,et al.  Imputation in U.S. Manufacturing Data and Its Implications for Productivity Dispersion , 2016, Review of Economics and Statistics.

[42]  T. Marwala,et al.  Fault classification in structures with incomplete measured data using autoassociative neural networks and genetic algorithm , 2006 .

[43]  A. Engelbrecht,et al.  Searching the forest: using decision trees as building blocks for evolutionary search in classification databases , 2000, Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512).

[44]  Chih-Fong Tsai,et al.  When Should We Ignore Examples with Missing Values? , 2017, Int. J. Data Warehous. Min..

[45]  Sandeep Kumar Singh,et al.  Hybrid prediction model with missing value imputation for medical data , 2015, Expert Syst. Appl..

[46]  GPShin'ichi Oka,et al.  Design of Decision Trees through Integration of C4.5 and GP , 2007 .

[47]  Mengjie Zhang,et al.  Multiple imputation and genetic programming for classification with incomplete data , 2017, GECCO.

[48]  Peerapon Vateekul,et al.  Tree-Based Approach to Missing Data Imputation , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[49]  Gerhard Tutz,et al.  Improved methods for the imputation of missing data by nearest neighbor methods , 2015, Comput. Stat. Data Anal..

[50]  P. Nordin Genetic Programming III - Darwinian Invention and Problem Solving , 1999 .

[51]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[52]  Qi Long,et al.  Variable selection in the presence of missing data: resampling and imputation. , 2015, Biostatistics.

[53]  Wenhao Shu,et al.  Mutual information criterion for feature selection from incomplete data , 2015, Neurocomputing.

[54]  Quan Pan,et al.  Adaptive imputation of missing values for incomplete pattern classification , 2016, Pattern Recognit..

[55]  Mengjie Zhang,et al.  A Genetic Programming-Based Imputation Method for Classification with Missing Data , 2016, EuroGP.

[56]  Maarten Keijzer,et al.  Improving Symbolic Regression with Interval Arithmetic and Linear Scaling , 2003, EuroGP.

[57]  Larry Bull,et al.  Genetic Programming with a Genetic Algorithm for Feature Construction and Selection , 2005, Genetic Programming and Evolvable Machines.

[58]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[59]  Gavin Brown,et al.  Learn++.MF: A random subspace approach for the missing feature problem , 2010, Pattern Recognit..

[60]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[61]  Mengjie Zhang,et al.  Gaussian Based Particle Swarm Optimisation and Statistical Clustering for Feature Selection , 2014, EvoCOP.

[62]  Hong Yan,et al.  The theoretic framework of local weighted approximation for microarray missing value estimation , 2010, Pattern Recognit..

[63]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[64]  Yiwen Zhang,et al.  A selective neural network ensemble classification for incomplete data , 2016, International Journal of Machine Learning and Cybernetics.

[65]  Bing Yu,et al.  Clustering-Based Multiple Imputation via Gray Relational Analysis for Missing Data and Its Application to Aerospace Field , 2013, TheScientificWorldJournal.

[66]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[67]  Georgios Dounias,et al.  Evolving rule-based systems in two medical domains using genetic programming , 2004, Artif. Intell. Medicine.

[68]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[69]  Kai Jiang,et al.  Classification for Incomplete Data Using Classifier Ensembles , 2005, 2005 International Conference on Neural Networks and Brain.

[70]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[71]  Dimitrios Gunopulos,et al.  Feature selection for the naive bayesian classifier using decision trees , 2003, Appl. Artif. Intell..

[72]  Kenneth Hennessy,et al.  An improved genetic programming technique for the classification of Raman spectra , 2004, Knowl. Based Syst..

[73]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[74]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Esther-Lydia Silva-Ramírez,et al.  Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns , 2015, Appl. Soft Comput..

[76]  Luiz Eduardo Soares de Oliveira,et al.  Feature selection for ensembles applied to handwriting recognition , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[77]  Todd E. Bodner Missing data and small-area estimation: Modern analytical equipment for the survey statistician , 2007 .

[78]  S. van Buuren,et al.  Flexible mutlivariate imputation by MICE , 1999 .

[79]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[80]  K. I. Ramachandran,et al.  Feature selection using Decision Tree and classification through Proximal Support Vector Machine for fault diagnostics of roller bearing , 2007 .

[81]  Pilsung Kang,et al.  Locally linear reconstruction based missing value imputation for supervised learning , 2013, Neurocomputing.

[82]  Walter Alden Tackett,et al.  Genetic Programming for Feature Discovery and Image Discrimination , 1993, ICGA.

[83]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[84]  Bir Bhanu,et al.  Fingerprint classification based on learned features , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[85]  Robert P. W. Duin,et al.  Combining One-Class Classifiers to Classify Missing Data , 2004, Multiple Classifier Systems.

[86]  Abdesselam Bouzerdoum,et al.  Automatic selection of features for classification using genetic programming , 1996, 1996 Australian New Zealand Conference on Intelligent Information Systems. Proceedings. ANZIIS 96.

[87]  Sara Silva,et al.  Classification of Seafloor Habitats Using Genetic Programming , 2008, EvoWorkshops.

[88]  Ian R White,et al.  Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals , 2004, Clinical trials.

[89]  Mengjie Zhang,et al.  Improving performance for classification with incomplete data using wrapper-based feature selection , 2016, Evol. Intell..

[90]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[91]  Estevam R. Hruschka,et al.  Bayesian networks for imputation in classification problems , 2007, Journal of Intelligent Information Systems.

[92]  Mengjie Zhang,et al.  A Comprehensive Comparison on Evolutionary Feature Selection Approaches to Classification , 2015, Int. J. Comput. Intell. Appl..

[93]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[94]  Zoran Obradovic,et al.  Margin-Based Feature Selection in Incomplete Data , 2012, AAAI.

[95]  Mengjie Zhang,et al.  Genetic programming based feature construction for classification with incomplete data , 2017, GECCO.

[96]  Chih-Fong Tsai,et al.  Combining instance selection for better missing value imputation , 2016, J. Syst. Softw..

[97]  Durga Toshniwal,et al.  Missing Value Imputation Based on K-Mean Clustering with Weighted Distance , 2010, IC3.

[98]  Md Zahidul Islam,et al.  A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing , 2011, AusDM.

[99]  Erik D. Goodman,et al.  Genetic programming for improved data mining: application to the biochemistry of protein interactions , 1996 .

[100]  Mengjie Zhang,et al.  Genetic programming for medical classification: a program simplification approach , 2008, Genetic Programming and Evolvable Machines.

[101]  Mengjie Zhang,et al.  Genetic programming for feature construction and selection in classification on high-dimensional data , 2016, Memetic Comput..

[102]  Mengjie Zhang,et al.  Using Gaussian distribution to construct fitness functions in genetic programming for multiclass object classification , 2006, Pattern Recognit. Lett..

[103]  Habshah Midi,et al.  Robust regression imputation for analyzing missing data , 2012, 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE).

[104]  George D. Smith,et al.  Evolutionary constructive induction , 2005, IEEE Transactions on Knowledge and Data Engineering.

[105]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[106]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[107]  Robi Polikar,et al.  An ensemble of classifiers approach for the missing feature problem , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[108]  Wilfrido Gómez-Flores,et al.  Automatic clustering using nature-inspired metaheuristics: A survey , 2016, Appl. Soft Comput..

[109]  Mengjie Zhang,et al.  A New Crossover Operator in Genetic Programming for Object Classification , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[110]  Leonardo Vanneschi,et al.  An Introduction to Geometric Semantic Genetic Programming , 2015, NEO.

[111]  Arthur Tay,et al.  Mining multiple comprehensible classification rules using genetic programming , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[112]  Mengjie Zhang,et al.  Multiclass Object Classification Using Genetic Programming , 2004, EvoWorkshops.

[113]  Md Zahidul Islam,et al.  FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis , 2014, Knowl. Based Syst..

[114]  John R. Koza,et al.  Genetic programming as a means for programming computers by natural selection , 1994 .

[115]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[116]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .