We consider a model of unreliable or crowdsourced data where there is an underlying set of $n$ binary variables, each evaluator contributes a (possibly unreliable or adversarial) estimate of the values of some subset of $r$ of the variables, and the learner is given the true value of a constant number of variables. We show that, provided an $\alpha$-fraction of the evaluators are "good" (either correct, or with independent noise rate $p 1/(2-2p)^r$. This setting can be viewed as an instance of the semi-verified learning model introduced in [CSV17], which explores the tradeoff between the number of items evaluated by each worker and the fraction of good evaluators. Our results require the number of evaluators to be extremely large, $>n^r$, although our algorithm runs in linear time, $O_{r,\epsilon}(n)$, given query access to the large dataset of evaluations. This setting and results can also be viewed as examining a general class of semi-adversarial CSPs with a planted assignment.
This parameter regime where the fraction of reliable data is small, is relevant to a number of practical settings. For example, settings where one has a large dataset of customer preferences, with each customer specifying preferences for a small (constant) number of items, and the goal is to ascertain the preferences of a specific demographic of interest. Our results show that this large dataset (which lacks demographic information) can be leveraged together with the preferences of the demographic of interest for a constant number of randomly selected items, to recover an accurate estimate of the entire set of preferences. In this sense, our results can be viewed as a "data prism" allowing one to extract the behavior of specific cohorts from a large, mixed, dataset.
[1]
Gregory Valiant,et al.
Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction
,
2016,
NIPS.
[2]
Santosh S. Vempala,et al.
Agnostic Estimation of Mean and Covariance
,
2016,
2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).
[3]
Daniel M. Kane,et al.
Robust Estimators in High Dimensions without the Computational Intractability
,
2016,
2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).
[4]
Gregory Valiant,et al.
Learning from untrusted data
,
2016,
STOC.
[5]
Santosh S. Vempala,et al.
University of Birmingham On the Complexity of Random Satisfiability Problems with Planted Solutions
,
2018
.
[6]
Prasad Raghavendra,et al.
Strongly refuting random CSPs below the spectral threshold
,
2016,
STOC.
[7]
Emmanuel J. Candès,et al.
Matrix Completion With Noise
,
2009,
Proceedings of the IEEE.
[8]
Andrea Montanari,et al.
Matrix completion from a few entries
,
2009,
ISIT.
[9]
John Law,et al.
Robust Statistics—The Approach Based on Influence Functions
,
1986
.
[10]
Prateek Jain,et al.
Robust Regression via Hard Thresholding
,
2015,
NIPS.
[11]
David Haussler,et al.
Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension
,
1995,
J. Comb. Theory, Ser. A.