Learning from untrusted data

The vast majority of theoretical results in machine learning and statistics assume that the training data is a reliable reflection of the phenomena to be learned. Similarly, most learning techniques used in practice are brittle to the presence of large amounts of biased or malicious data. Motivated by this, we consider two frameworks for studying estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, list-decodable learning, asks whether it is possible to return a list of answers such that at least one is accurate. For example, given a dataset of n points for which an unknown subset of αn points are drawn from a distribution of interest, and no assumptions are made about the remaining (1 - α)n points, is it possible to return a list of poly(1/α) answers? The second framework, which we term the semi-verified model, asks whether a small dataset of trusted data (drawn from the distribution in question) can be used to extract accurate information from a much larger but untrusted dataset (of which only an α-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This result has immediate implications for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.

[1]  J. Tukey A survey of sampling from contaminated distributions , 1960 .

[2]  D. Ruppert Robust Statistics: The Approach Based on Influence Functions , 1987 .

[3]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[4]  Joel H. Spencer,et al.  Coloring Random and Semi-Random k-Colorable Graphs , 1995, J. Algorithms.

[5]  M. Rudelson Random Vectors in the Isotropic Position , 1996, math/9608208.

[6]  U. Feige,et al.  Finding and certifying a large hidden clique in a semirandom graph , 2000 .

[7]  U. Feige,et al.  Finding and certifying a large hidden clique in a semirandom graph , 2000, Random Struct. Algorithms.

[8]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[9]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[10]  Uriel Feige,et al.  Heuristics for Semirandom Graph Problems , 2001, J. Comput. Syst. Sci..

[11]  Satish Rao,et al.  A tight bound on approximating arbitrary metrics by tree metrics , 2003, STOC '03.

[12]  Amin Coja-Oghlan Coloring Semirandom Graphs Optimally , 2004, ICALP.

[13]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[14]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[15]  Prasad Raghavendra,et al.  Hardness of Learning Halfspaces with Noise , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[16]  Michael Krivelevich,et al.  Semirandom Models as Benchmarks for Coloring Algorithms , 2006, ANALCO.

[17]  Amin Coja-Oghlan Solving NP-hard semirandom graph problems in polynomial expected time , 2007, J. Algorithms.

[18]  Santosh S. Vempala,et al.  A discriminative framework for clustering via similarity functions , 2008, STOC.

[19]  Maria-Florina Balcan,et al.  Agnostic Clustering , 2009, ALT.

[20]  Vitaly Feldman,et al.  On Agnostic Learning of Parities, Monomials, and Halfspaces , 2009, SIAM J. Comput..

[21]  Shie Mannor,et al.  High dimensional Principal Component Analysis with contaminated data , 2009, 2009 IEEE Information Theory Workshop on Networking and Information Theory.

[22]  Rocco A. Servedio,et al.  Learning Halfspaces with Malicious Noise , 2009, ICALP.

[23]  Nikhil Srivastava,et al.  Twice-ramanujan sparsifiers , 2008, STOC '09.

[24]  Shie Mannor,et al.  Principal Component Analysis with Contaminated Data: The High Dimensional Case , 2010, COLT 2010.

[25]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[26]  Yudong Chen,et al.  Clustering Partially Observed Graphs via Convex Optimization , 2011, ICML.

[27]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[28]  Pablo A. Parrilo,et al.  Rank-Sparsity Incoherence for Matrix Decomposition , 2009, SIAM J. Optim..

[29]  Nikhil Srivastava,et al.  Twice-Ramanujan Sparsifiers , 2012, SIAM J. Comput..

[30]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[31]  Aravindan Vijayaraghavan,et al.  Approximation algorithms for semi-random partitioning problems , 2012, STOC '12.

[32]  Pranjal Awasthi,et al.  Improved Spectral-Norm Bounds for Clustering , 2012, APPROX-RANDOM.

[33]  Ankur Moitra,et al.  Algorithms and Hardness for Robust Subspace Recovery , 2012, COLT.

[34]  Roman Vershynin,et al.  Community detection in sparse networks via Grothendieck’s inequality , 2014, ArXiv.

[35]  Xiaodong Li,et al.  Robust and Computationally Feasible Community Detection in the Presence of Arbitrary Outlier Nodes , 2014, ArXiv.

[36]  S. Sanghavi,et al.  Improved Graph Clustering , 2012, IEEE Transactions on Information Theory.

[37]  Prateek Jain,et al.  Robust Regression via Hard Thresholding , 2015, NIPS.

[38]  Emmanuel Abbe,et al.  Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms , 2015, ArXiv.

[39]  Emmanuel Abbe,et al.  Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic BP, and the information-computation gap , 2015, ArXiv.

[40]  Alexandra Kolla,et al.  Multisection in the Stochastic Block Model using Semidefinite Programming , 2015, ArXiv.

[41]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[42]  Ankur Moitra,et al.  How robust are reconstruction thresholds for community detection? , 2015, STOC.

[43]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[44]  Gregory Valiant,et al.  Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction , 2016, NIPS.

[45]  Shai Ben-David,et al.  Finding Meaningful Cluster Structure Amidst Background Noise , 2016, ALT.

[46]  R. Vershynin,et al.  Norms of random matrices: local and global problems , 2016, 1608.06953.

[47]  Maria-Florina Balcan,et al.  The Power of Localization for Efficiently Learning Linear Separators with Noise , 2013, J. ACM.

[48]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[49]  Can M. Le,et al.  Concentration and regularization of random graphs , 2015, Random Struct. Algorithms.

[50]  K. Makarychev,et al.  J un 2 01 6 Learning Communities in the Presence of Errors , 2018 .

[51]  Konstantin E. Tikhomirov,et al.  Coverings of random ellipsoids, and invertibility of matrices with i.i.d. heavy-tailed entries , 2015, Israel Journal of Mathematics.