论文信息 - A Data Prism: Semi-Verified Learning in the Small-Alpha Regime

A Data Prism: Semi-Verified Learning in the Small-Alpha Regime

We consider a model of unreliable or crowdsourced data where there is an underlying set of $n$ binary variables, each evaluator contributes a (possibly unreliable or adversarial) estimate of the values of some subset of $r$ of the variables, and the learner is given the true value of a constant number of variables. We show that, provided an $\alpha$-fraction of the evaluators are "good" (either correct, or with independent noise rate $p 1/(2-2p)^r$. This setting can be viewed as an instance of the semi-verified learning model introduced in [CSV17], which explores the tradeoff between the number of items evaluated by each worker and the fraction of good evaluators. Our results require the number of evaluators to be extremely large, $>n^r$, although our algorithm runs in linear time, $O_{r,\epsilon}(n)$, given query access to the large dataset of evaluations. This setting and results can also be viewed as examining a general class of semi-adversarial CSPs with a planted assignment. This parameter regime where the fraction of reliable data is small, is relevant to a number of practical settings. For example, settings where one has a large dataset of customer preferences, with each customer specifying preferences for a small (constant) number of items, and the goal is to ascertain the preferences of a specific demographic of interest. Our results show that this large dataset (which lacks demographic information) can be leveraged together with the preferences of the demographic of interest for a constant number of randomly selected items, to recover an accurate estimate of the entire set of preferences. In this sense, our results can be viewed as a "data prism" allowing one to extract the behavior of specific cohorts from a large, mixed, dataset.

Gregory Valiant | Michela Meister | G. Valiant | Michela Meister

[1] Gregory Valiant,et al. Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction , 2016, NIPS.

[2] Santosh S. Vempala,et al. Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[3] Daniel M. Kane,et al. Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[4] Gregory Valiant,et al. Learning from untrusted data , 2016, STOC.

[5] Santosh S. Vempala,et al. University of Birmingham On the Complexity of Random Satisfiability Problems with Planted Solutions , 2018 .

[6] Prasad Raghavendra,et al. Strongly refuting random CSPs below the spectral threshold , 2016, STOC.

[7] Emmanuel J. Candès,et al. Matrix Completion With Noise , 2009, Proceedings of the IEEE.

[8] Andrea Montanari,et al. Matrix completion from a few entries , 2009, ISIT.

[9] John Law,et al. Robust Statistics—The Approach Based on Influence Functions , 1986 .

[10] Prateek Jain,et al. Robust Regression via Hard Thresholding , 2015, NIPS.

[11] David Haussler,et al. Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.