Finding Influential Training Samples for Gradient Boosted Decision Trees

We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model's predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.

[1]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[2]  Peter Kulchyski and , 2015 .

[3]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[4]  Fabrizio Silvestri,et al.  Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking , 2017, KDD.

[5]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[6]  Anna Veronika Dorogush,et al.  Fighting biases with dynamic boosting , 2017, ArXiv.

[7]  S. Weisberg,et al.  Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression , 1980 .

[8]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[9]  Daniel Neagu,et al.  Interpreting random forest models using a feature contribution method , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[10]  Cengiz Öztireli,et al.  Towards better understanding of gradient-based attribution methods for Deep Neural Networks , 2017, ICLR.

[11]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[12]  Beata Strack,et al.  Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records , 2014, BioMed research international.

[13]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[14]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[15]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[16]  Scott M. Lundberg,et al.  Consistent feature attribution for tree ensembles , 2017, ArXiv.

[17]  Andrea Vedaldi,et al.  Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Markus H. Gross,et al.  A unified view of gradient-based attribution methods for Deep Neural Networks , 2017, NIPS 2017.