BEST: a decision tree algorithm that handles missing values

The main contribution of this paper is the development of a new decision tree algorithm. The proposed approach allows users to guide the algorithm through the data partitioning process. We believe this feature has many applications but in this paper we demonstrate how to utilize this algorithm to analyse data sets containing missing values. We tested our algorithm against simulated data sets with various missing data structures and a real data set. The results demonstrate that this new classification procedure efficiently handles missing values and produces results that are slightly more accurate and more interpretable than most common procedures without any imputations or pre-processing.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Dan Jackson,et al.  What Is Meant by "Missing at Random"? , 2013, 1306.2812.

[3]  B. Ripley,et al.  Recursive Partitioning and Regression Trees , 2015 .

[4]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[5]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  David J. Hand,et al.  Good methods for coping with missing data in decision trees , 2008, Pattern Recognit. Lett..

[8]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[9]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Ron Kohavi,et al.  Lazy Decision Trees , 1996, AAAI/IAAI, Vol. 1.

[12]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[13]  J L Schafer,et al.  Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[14]  A. J. Feelders,et al.  Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation , 1999, PKDD.

[15]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[16]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[17]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[18]  Nicholas J. Tierney,et al.  Using decision trees to understand structure in missing data , 2015, BMJ Open.

[19]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[20]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[21]  Jeffrey S. Rosenthal,et al.  Predicting University Students’ Academic Success and Major Using Random Forests , 2018, Research in Higher Education.

[22]  Jeffrey S. Simonoff,et al.  An Investigation of Missing Data Methods for Classification Trees , 2006, J. Mach. Learn. Res..

[23]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[24]  Bhekisipho Twala,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES , 2009, Appl. Artif. Intell..

[25]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[26]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[27]  Sachin Gavankar,et al.  Decision Tree: Review of Techniques for Missing Values at Training, Testing and Compatibility , 2015, 2015 3rd International Conference on Artificial Intelligence, Modelling and Simulation (AIMS).

[28]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[29]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[30]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[31]  A. Boulesteix,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[32]  Grades and incentives: assessing competing grade point average measures and postgraduate outcomes , 2016 .

[33]  Jeffrey S. Rosenthal,et al.  Predicting University Students' Academic Success and Choice of Major using Random Forests , 2018, ArXiv.

[34]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[35]  Nikos Sidiropoulos,et al.  SinaPlot: an enhanced chart for simple and truthful representation of single observations over multiple classes , 2015, bioRxiv.