Efficient computation and analysis of distributional Shapley values

Distributional data Shapley value (DShapley) has been recently proposed as a principled framework to quantify the contribution of individual datum in machine learning. DShapley develops the foundational game theory concept of Shapley values into a statistical framework and can be applied to identify data points that are useful (or harmful) to a learning algorithm. Estimating DShapley is computationally expensive, however, and this can be a major challenge to using it in practice. Moreover, there has been little mathematical analyses of how this value depends on data characteristics. In this paper, we derive the first analytic expressions for DShapley for the canonical problems of linear regression and non-parametric density estimation. These analytic forms provide new algorithms to compute DShapley that are several orders of magnitude faster than previous state-of-the-art. Furthermore, our formulas are directly interpretable and provide quantitative insights into how the value varies for different types of data. We demonstrate the efficacy of our DShapley approach on multiple real and synthetic datasets.

[1]  James Zou,et al.  A Distributional Framework for Data Valuation , 2020, ICML.

[2]  James Y. Zou,et al.  Neuron Shapley: Discovering the Responsible Neurons , 2020, NeurIPS.

[3]  Raul Castro Fernandez,et al.  Data market platforms , 2020, Proc. VLDB Endow..

[4]  Sercan O. Arik,et al.  Data Valuation using Reinforcement Learning , 2019, ICML.

[5]  Mukund Sundararajan,et al.  The many Shapley values for model explanation , 2019, ICML.

[6]  Costas J. Spanos,et al.  Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms , 2019, Proc. VLDB Endow..

[7]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[8]  Costas J. Spanos,et al.  Towards Efficient Data Valuation Based on the Shapley Value , 2019, AISTATS.

[9]  Le Song,et al.  L-Shapley and C-Shapley: Efficient Model Interpretation for Structured Data , 2018, ICLR.

[10]  Munther A. Dahleh,et al.  A Marketplace for Data: An Algorithmic Solution , 2018, EC.

[11]  Glen Weyl,et al.  Radical Markets: Uprooting Capitalism and Democracy for a Just Society , 2018 .

[12]  Fatemeh Afghah,et al.  A Feature Selection Method Based on Shapley Value to False Alarm Reduction in ICUs A Genetic-Algorithm Approach , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[13]  Shannon R. McCurdy Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling , 2018, NeurIPS.

[14]  Sucharita Ghosh,et al.  Kernel Smoothing: Principles, Methods and Applications: Principles, Methods and Applications , 2017 .

[15]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[16]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[17]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[18]  Michael B. Cohen,et al.  Input Sparsity Time Low-rank Approximation via Ridge Leverage Score Sampling , 2015, SODA.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[21]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[24]  Eytan Ruppin,et al.  Feature Selection Based on the Shapley Value , 2005, IJCAI.

[25]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[26]  R. Dennis Cook,et al.  Detection of Influential Observation in Linear Regression , 2000, Technometrics.

[27]  Michel Grabisch,et al.  An axiomatic approach to the concept of interaction among players in cooperative games , 1999, Int. J. Game Theory.

[28]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[29]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[30]  Hervé Moulin,et al.  An Application of the Shapley Value to Fair Division with Money , 1992 .

[31]  Faruk Gul Bargaining Foundations of Shapley Value , 1989 .

[32]  L. Shapley A Value for n-person Games , 1988 .

[33]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[34]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[35]  Pradeep Dubey The Shapley Value as Aircraft Landing Fees--Revisited , 1982 .

[36]  Pradeep Dubey,et al.  Value Theory Without Efficiency , 1981, Math. Oper. Res..

[37]  S. Weisberg,et al.  Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression , 1980 .

[38]  L. Shapley,et al.  Values of Non-Atomic Games , 1974 .

[39]  F. Hampel The Influence Curve and Its Role in Robust Estimation , 1974 .

[40]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[41]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .