Influence Measures for CART Classification Trees

This paper deals with measuring the influence of observations on the results obtained with CART classification trees. To define the influence of individuals on the analysis, we use influence measures to propose criterions to quantify the sensitivity of the CART classification tree analysis. The proposals are based on predictions and use jackknife trees. The analysis is extended to the pruned sequences of CART trees to produce CART specific notions of influence. Using the framework of influence functions, distributional results are derived.A numerical example, the well known spam dataset, is presented to illustrate the notions developed throughout the paper. A real dataset relating the administrative classification of cities surrounding Paris, France, to the characteristics of their tax revenues distribution, is finally analyzed using the new influence-based tools.

[1]  Gilles R. Ducharme,et al.  Computational Statistics and Data Analysis a Similarity Measure to Assess the Stability of Classification Trees , 2022 .

[2]  Peter Filzmoser,et al.  CLASSIFICATION EFFICIENCIES FOR ROBUST LINEAR DISCRIMINANT ANALYSIS , 2008 .

[3]  Christophe Croux,et al.  Influence of observations on the misclassification probability in quadratic discriminant analysis , 2005 .

[4]  Mia Hubert,et al.  LIBRA: a MATLAB library for robust analysis , 2005 .

[5]  N. Campbell,et al.  The Influence Function as an Aid in Outlier Detection in Discriminant Analysis , 1978 .

[6]  Christophe Croux,et al.  Logistic discrimination using robust estimators: An influence function approach , 2008 .

[7]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[8]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[9]  J. Romo,et al.  On the estimation of the influence curve , 1995 .

[10]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[11]  Gilbert Saporta,et al.  Comparing partitions of two sets of units based on the same variables , 2010, Adv. Data Anal. Classif..

[12]  Jean-Michel Poggi,et al.  Boosting and instability for regression trees , 2006, Comput. Stat. Data Anal..

[13]  M. Mariadassou,et al.  Influence function for robust phylogenetic reconstructions. , 2008, Molecular biology and evolution.

[14]  Denis Allard,et al.  CART algorithm for spatial data: Application to environmental and ecological data , 2009, Comput. Stat. Data Anal..

[15]  R. Gill Non- and semi-parametric maximum likelihood estimators and the Von Mises method , 1986 .

[16]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[17]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[18]  Brian D. Ripley,et al.  Modern Applied Statistics with S Fourth edition , 2002 .

[19]  Jean-Michel Poggi,et al.  Outlier Detection by Boosting Regression Trees , 2006 .

[20]  Rupert G. Miller The jackknife-a review , 1974 .

[21]  F. Hampel The Influence Curve and Its Role in Robust Estimation , 1974 .

[22]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[23]  Frank Critchley,et al.  The influence of observations on misclassification probability estimates in linear discriminant analysis , 1991 .

[24]  Gabriele Soffritti,et al.  The comparison between classification trees through proximity measures , 2004, Comput. Stat. Data Anal..

[25]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[26]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.