Certified Computation from Unreliable Datasets

A wide range of learning tasks require human input in labeling massive data. The collected data though are usually low quality and contain inaccuracies and errors. As a result, modern science and business face the problem of learning from unreliable data sets. In this work, we provide a generic approach that is based on \textit{verification} of only few records of the data set to guarantee high quality learning outcomes for various optimization objectives. Our method, identifies small sets of critical records and verifies their validity. We show that many problems only need $\text{poly}(1/\varepsilon)$ verifications, to ensure that the output of the computation is at most a factor of $(1 \pm \varepsilon)$ away from the truth. For any given instance, we provide an \textit{instance optimal} solution that verifies the minimum possible number of records to approximately certify correctness. Then using this instance optimal formulation of the problem we prove our main result: "every function that satisfies some Lipschitz continuity condition can be certified with a small number of verifications". We show that the required Lipschitz continuity condition is satisfied even by some NP-complete problems, which illustrates the generality and importance of this theorem. In case this certification step fails, an invalid record will be identified. Removing these records and repeating until success, guarantees that the result will be accurate and will depend only on the verified records. Surprisingly, as we show, for several computation tasks more efficient methods are possible. These methods always guarantee that the produced result is not affected by the invalid records, since any invalid record that affects the output will be detected and verified.

[1]  Gabriella Kazai,et al.  Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking , 2011, SIGIR.

[2]  Sofya Raskhodnikova,et al.  Testing and Reconstruction of Lipschitz Functions with Applications to Data Privacy , 2013, SIAM J. Comput..

[3]  Bernard Chazelle,et al.  Property-Preserving Data Reconstruction , 2004, Algorithmica.

[4]  Daniel M. Kane,et al.  Statistical Query Lower Bounds for Robust Estimation of High-Dimensional Gaussians and Gaussian Mixtures , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[5]  Jeroen B. P. Vuurens,et al.  How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[6]  Christos Tzamos,et al.  Faster Sublinear Algorithms using Conditional Sampling , 2017, SODA.

[7]  Adam Tauman Kalai,et al.  Feature Multi-Selection among Subjective Features , 2013, ICML.

[8]  Kyomin Jung,et al.  Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners , 2010, APPROX-RANDOM.

[9]  Rocco A. Servedio,et al.  Testing equivalence between distributions using conditional samples , 2014, SODA.

[10]  Shuchi Chawla,et al.  Optimal crowdsourcing contests , 2019, Games Econ. Behav..

[11]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[12]  Turk Paul Wais,et al.  Towards Building a High-Quality Workforce with Mechanical , 2010 .

[13]  Gregory Valiant,et al.  Learning from untrusted data , 2016, STOC.

[14]  Anirban Dasgupta,et al.  Crowdsourced judgement elicitation with endogenous proficiency , 2013, WWW.

[15]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[16]  John Law,et al.  Robust Statistics—The Approach Based on Influence Functions , 1986 .

[17]  Gregory Valiant,et al.  Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction , 2016, NIPS.

[18]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[19]  Y. Narahari,et al.  Mechanism Design for Time Critical and Cost Critical Task Execution via Crowdsourcing , 2012, WINE.

[20]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[21]  Ariel D. Procaccia,et al.  Incentive compatible regression learning , 2008, SODA '08.

[22]  Lydia B. Chilton,et al.  The labor economics of paid crowdsourcing , 2010, EC '10.

[23]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[24]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[25]  Michael E. Saks,et al.  Local Monotonicity Reconstruction , 2010, SIAM J. Comput..

[26]  Jerry Li,et al.  Robustly Learning a Gaussian: Getting Optimal Error, Efficiently , 2017, SODA.

[27]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[28]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[29]  Ronitt Rubinfeld,et al.  Sampling Correctors , 2015, Information Technology Convergence and Services.

[30]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[31]  Christos Tzamos,et al.  Mechanism Design with Selective Verification , 2016, EC.

[32]  Jerry Li,et al.  Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[33]  Manuel Blum,et al.  Self-testing/correcting with applications to numerical problems , 1990, STOC '90.