Rewarding High-Quality Data via Influence Functions

We consider a crowdsourcing data acquisition scenario, such as federated learning, where a Center collects data points from a set of rational Agents, with the aim of training a model. For linear regression models, we show how a payment structure can be designed to incentivize the agents to provide high-quality data as early as possible, based on a characterization of the influence that data points have on the loss function of the model. Our contributions can be summarized as follows: (a) we prove theoretically that this scheme ensures truthful data reporting as a game-theoretic equilibrium and further demonstrate its robustness against mixtures of truthful and heuristic data reports, (b) we design a procedure according to which the influence computation can be efficiently approximated and processed sequentially in batches over time, (c) we develop a theory that allows correcting the difference between the influence and the overall change in loss and (d) we evaluate our approach on real datasets, confirming our theoretical findings.

[1]  Yang Cai,et al.  Optimum Statistical Estimation with Strategic Data Sources , 2014, COLT.

[2]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[3]  Boi Faltings,et al.  Game Theory for Data Science: Eliciting Truthful Information , 2017, Game Theory for Data Science.

[4]  S. Shankar Sastry,et al.  Statistical estimation with strategic data sources in competitive settings , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[5]  Ariel D. Procaccia,et al.  Strategyproof Linear Regression in High Dimensions , 2018, EC.

[6]  Max A. Little,et al.  Accurate Telemonitoring of Parkinson's Disease Progression by Noninvasive Speech Tests , 2009, IEEE Transactions on Biomedical Engineering.

[7]  Jeroen B. P. Vuurens,et al.  How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[8]  Yunghsiang Sam Han,et al.  Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification , 2004, SDM.

[9]  E. Massera,et al.  On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario , 2008 .

[10]  S. Shankar Sastry,et al.  Competitive Statistical Estimation With Strategic Data Sources , 2019, IEEE Transactions on Automatic Control.

[11]  Alok Baveja,et al.  Computing , Artificial Intelligence and Information Technology A data-driven software tool for enabling cooperative information sharing among police departments , 2002 .

[12]  Stratis Ioannidis,et al.  Truthful Linear Regression , 2015, COLT.

[13]  Yuval Peres,et al.  Approval Voting and Incentives in Crowdsourcing , 2015, ICML.

[14]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[15]  S. Weisberg,et al.  Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression , 1980 .

[16]  Ariel D. Procaccia,et al.  Truthful Univariate Estimators , 2016, ICML.

[17]  Javier Perote Peña,et al.  Strategy-Proof Estimators for Simple Regression , 2003 .

[18]  Nicole Immorlica,et al.  Optimal Data Acquisition for Statistical Estimation , 2017, EC.

[19]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[20]  Ariel D. Procaccia,et al.  Algorithms for strategyproof classification , 2012, Artif. Intell..

[21]  Costas J. Spanos,et al.  Towards Efficient Data Valuation Based on the Shapley Value , 2019, AISTATS.

[22]  Boi Faltings,et al.  Incentives for Effort in Crowdsourcing Using the Peer Truth Serum , 2016, ACM Trans. Intell. Syst. Technol..

[23]  Ariel D. Procaccia,et al.  Incentive compatible regression learning , 2008, SODA '08.