Approximating XGBoost with an interpretable decision tree

Abstract The increasing usage of machine-learning models in critical domains has recently stressed the necessity of interpretable machine-learning models. In areas like healthcare, finary – the model consumer must understand the rationale behind the model output in order to use it when making a decision. For this reason, it is impossible to use black-box models in these scenarios, regardless of their high predictive performance. Decision forests, and in particular Gradient Boosting Decision Trees (GBDT), are examples of this kind of model. GBDT models are considered the state-of-the-art in many classification challenges, reflected by the fact that the majority of Kaggle’s recent winners used GBDT methods as a part of their solution (such as XGBoost). But despite their superior predictive performance, they cannot be used in tasks that require transparency. This paper presents a novel method for transforming a decision forest of any kind into an interpretable decision tree. The method extends the tool-set available for machine learning practitioners, who want to exploit the interpretability of decision trees without significantly impairing the predictive performance gained by GBDT models like XGBoost. We show in an empirical evaluation that in some cases the generated tree is able to approximate the predictive performance of a XGBoost model while enabling better transparency of the outputs.

[1]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[2]  Hussein Almuallim,et al.  Turning majority voting classifiers into a single decision tree , 1998, Proceedings Tenth IEEE International Conference on Tools with Artificial Intelligence (Cat. No.98CH36294).

[3]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[4]  Chang-an Wu,et al.  Forest Pruning Based on Branch Importance , 2017, Comput. Intell. Neurosci..

[5]  Xiangliang Zhang,et al.  An up-to-date comparison of state-of-the-art classification algorithms , 2017, Expert Syst. Appl..

[6]  C. Apte,et al.  Data mining with decision trees and decision rules , 1997, Future Gener. Comput. Syst..

[7]  William Nick Street,et al.  Ensemble Pruning Via Semi-definite Programming , 2006, J. Mach. Learn. Res..

[8]  Francisco Herrera,et al.  Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI , 2020, Inf. Fusion.

[9]  Yunfeng Zhang,et al.  Think Your Artificial Intelligence Software Is Fair? Think Again , 2019, IEEE Software.

[10]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[11]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[12]  Alexandra Chouldechova,et al.  A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions , 2018, FAT.

[13]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[14]  Qinghua Hu,et al.  EROS: Ensemble rough subspaces , 2007, Pattern Recognit..

[15]  Filip De Turck,et al.  GENESIM: genetic extraction of a single, interpretable model , 2016, NIPS 2016.

[16]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[17]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[18]  Longfei Li,et al.  Interpretable MTL from Heterogeneous Domains using Boosted Tree , 2019, CIKM.

[19]  Shuai Zhang,et al.  A novel ensemble method for credit scoring: Adaption of different imbalance ratios , 2018, Expert Syst. Appl..

[20]  Philip S. Yu,et al.  Pruning and dynamic scheduling of cost-sensitive ensembles , 2002, AAAI/IAAI.

[21]  Pedro M. Domingos Knowledge Discovery Via Multiple Models , 1998, Intell. Data Anal..

[22]  Minzhu Xie,et al.  XGBFEMF: An XGBoost-Based Framework for Essential Protein Prediction , 2018, IEEE Transactions on NanoBioscience.

[23]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[24]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  P. K. Sinha,et al.  Pruning of Random Forest classifiers: A survey and future directions , 2012, 2012 International Conference on Data Science & Engineering (ICDSE).

[26]  Mohan S. Kankanhalli,et al.  Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda , 2018, CHI.

[27]  Mohamed Medhat Gaber,et al.  CHIRPS: Explaining random forest classification , 2020, Artificial Intelligence Review.

[28]  Amina Adadi,et al.  Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI) , 2018, IEEE Access.

[29]  Lior Rokach,et al.  Explainable decision forest: Transforming a decision forest into an interpretable tree , 2020, Inf. Fusion.

[30]  Suzhen Wang,et al.  Imbalance-XGBoost: Leveraging Weighted and Focal Losses for Binary Label-Imbalanced Classification with XGBoost , 2019, Pattern Recognit. Lett..

[31]  Jude W. Shavlik,et al.  in Advances in Neural Information Processing , 1996 .

[32]  Sasikiran Kandula,et al.  Reappraising the utility of Google Flu Trends , 2019, PLoS Comput. Biol..

[33]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[34]  Fu Jiang,et al.  XGBoost Classifier for DDoS Attack Detection and Analysis in SDN-Based Cloud , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).

[35]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Xing Chen,et al.  EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association prediction , 2018, Cell Death & Disease.

[37]  Nagiza F. Samatova,et al.  Theory-Guided Data Science: A New Paradigm for Scientific Discovery from Data , 2016, IEEE Transactions on Knowledge and Data Engineering.

[38]  Sanjay Ranka,et al.  Global Model Interpretation Via Recursive Partitioning , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[39]  Maarten de Rijke,et al.  Explaining Predictions from Tree-based Boosting Ensembles , 2019, ArXiv.

[40]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[41]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[42]  Nenad Stanić,et al.  Explainable extreme gradient boosting tree-based prediction of toluene, ethylbenzene and xylene wet deposition. , 2019, The Science of the total environment.

[43]  Hendrik Blockeel,et al.  Seeing the Forest Through the Trees: Learning a Comprehensible Model from an Ensemble , 2007, ECML.

[44]  Lior Rokach,et al.  Decision forest: Twenty years of research , 2016, Inf. Fusion.