论文信息 - Finding Influential Training Samples for Gradient Boosted Decision Trees

Finding Influential Training Samples for Gradient Boosted Decision Trees

We address the problem of finding influential training samples for a particular case of tree ensemble-based models, e.g., Random Forest (RF) or Gradient Boosted Decision Trees (GBDT). A natural way of formalizing this problem is studying how the model's predictions change upon leave-one-out retraining, leaving out each individual training sample. Recent work has shown that, for parametric models, this analysis can be conducted in a computationally efficient way. We propose several ways of extending this framework to non-parametric GBDT ensembles under the assumption that tree structures remain fixed. Furthermore, we introduce a general scheme of obtaining further approximations to our method that balance the trade-off between performance and computational complexity. We evaluate our approaches on various experimental setups and use-case scenarios and demonstrate both the quality of our approach to finding influential training samples in comparison to the baselines and its computational efficiency.

[1] Ankur Taly,et al. Axiomatic Attribution for Deep Networks , 2017, ICML.

[2] Peter Kulchyski. and , 2015 .

[3] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.

[4] Fabrizio Silvestri,et al. Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking , 2017, KDD.

[5] Percy Liang,et al. Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[6] Anna Veronika Dorogush,et al. Fighting biases with dynamic boosting , 2017, ArXiv.

[7] S. Weisberg,et al. Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression , 1980 .

[8] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[9] Daniel Neagu,et al. Interpreting random forest models using a feature contribution method , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[10] Cengiz Öztireli,et al. Towards better understanding of gradient-based attribution methods for Deep Neural Networks , 2017, ICLR.

[11] Avanti Shrikumar,et al. Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[12] Beata Strack,et al. Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records , 2014, BioMed research international.

[13] Tianqi Chen,et al. XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[14] Anna Veronika Dorogush,et al. CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[15] J. Friedman. Greedy function approximation: A gradient boosting machine. , 2001 .

[16] Scott M. Lundberg,et al. Consistent feature attribution for tree ensembles , 2017, ArXiv.

[17] Andrea Vedaldi,et al. Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18] Markus H. Gross,et al. A unified view of gradient-based attribution methods for Deep Neural Networks , 2017, NIPS 2017.