Robust Mean Estimation under Coordinate-level Corruption

We study the problem of robust mean estimation and introduce a novel Hamming distance-based measure of distribution shift for coordinate-level corruptions. We show that this measure yields adversary models that capture more realistic corruptions than those used in prior works, and present an information-theoretic analysis of robust mean estimation in these settings. We show that for structured distributions, methods that leverage the structure yield information theoretically more accurate mean estimation. We also focus on practical algorithms for robust mean estimation and study when data cleaning-inspired approaches that first fix corruptions in the input data and then perform robust mean estimation can match the information theoretic bounds of our analysis. We finally demonstrate experimentally that this two-step approach outperforms structure-agnostic robust estimation and provides accurate mean estimation even for high-magnitude corruption.

[1]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[2]  J. Tukey Mathematics and the Picturing of Data , 1975 .

[3]  Elwyn R. Berlekamp,et al.  On the inherent intractability of certain coding problems (Corresp.) , 1978, IEEE Trans. Inf. Theory.

[4]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[5]  Kenneth Burdett,et al.  Truncated means and variances , 1996 .

[6]  David C. Swanson Signal Processing for Intelligent Sensor Systems , 2000 .

[7]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[8]  Marek Karpinski,et al.  Approximating minimum unsatisfiability of linear equations , 2002, SODA '02.

[9]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[10]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[11]  I-Cheng Yeh,et al.  Knowledge discovery on RFM model using Bernoulli sequence , 2009, Expert Syst. Appl..

[12]  Michael B. Wakin,et al.  Analysis of Orthogonal Matching Pursuit Using the Restricted Isometry Property , 2009, IEEE Transactions on Information Theory.

[13]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[14]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[15]  Ankur Moitra,et al.  Algorithms and Hardness for Robust Subspace Recovery , 2012, COLT.

[16]  André R. S. Marçal,et al.  Evaluation of Features for Leaf Discrimination , 2013, ICIAR.

[17]  Anru Zhang,et al.  Sharp RIP bound for sparse signal and low-rank matrix recovery , 2013 .

[18]  Qinfeng Shi,et al.  Sensor enabled wearable RFID technology for mitigating the risk of falls near beds , 2013, 2013 IEEE International Conference on RFID (RFID).

[19]  Pravesh Kothari,et al.  Embedding Hard Learning Problems into Gaussian Space , 2014, Electron. Colloquium Comput. Complex..

[20]  Thinh P. Nguyen,et al.  Performance analysis for matrix completion via iterative hard-thresholded SVD , 2014, 2014 IEEE Workshop on Statistical Signal Processing (SSP).

[21]  Po-Ling Loh,et al.  High-dimensional robust precision matrix estimation: Cellwise corruption under $\epsilon$-contamination , 2015, 1509.07229.

[22]  Nigel Boston,et al.  A characterization of deterministic sampling patterns for low-rank matrix completion , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[23]  K. Cios,et al.  Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[24]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[25]  Jakub W. Pachocki,et al.  Geometric median in nearly linear time , 2016, STOC.

[26]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[27]  Jerry Li,et al.  Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[28]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[29]  Gregory Valiant,et al.  Learning from untrusted data , 2016, STOC.

[30]  Lingxiao Wang,et al.  Robust Gaussian Graphical Model Estimation with Arbitrary Corruption , 2017, ICML.

[31]  Dimitris Bertsimas,et al.  From Predictive Methods to Missing Data Imputation: An Optimization Approach , 2017, J. Mach. Learn. Res..

[32]  Chao Gao Robust regression via mutivariate regression depth , 2017, Bernoulli.

[33]  Indranil Gupta,et al.  Phocas: dimensional Byzantine-resilient stochastic gradient descent , 2018, ArXiv.

[34]  John C. Duchi,et al.  Learning Models with Uniform Performance via Distributionally Robust Optimization , 2018, ArXiv.

[35]  Christos Tzamos,et al.  Efficient Statistics, in High Dimensions, from Truncated Samples , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[36]  Pravesh Kothari,et al.  Efficient Algorithms for Outlier-Robust Regression , 2018, COLT.

[37]  Sivaraman Balakrishnan,et al.  Robust estimation via robust gradient estimation , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[38]  Daniel M. Kane,et al.  Recent Advances in Algorithmic High-Dimensional Robust Statistics , 2019, ArXiv.

[39]  Gilad Lerman,et al.  Robust Subspace Recovery with Adversarial Outliers , 2019, ArXiv.

[40]  Guy Van den Broeck,et al.  What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features , 2019, IJCAI.

[41]  David P. Woodruff,et al.  Faster Algorithms for High-Dimensional Robust Covariance Estimation , 2019, COLT.

[42]  Christos Tzamos,et al.  Efficient Truncated Statistics with Unknown Truncation , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[43]  Ilias Diakonikolas,et al.  Efficient Algorithms and Lower Bounds for Robust Linear Regression , 2018, SODA.

[44]  Jerry Li,et al.  How Hard Is Robust Mean Estimation? , 2019, COLT.

[45]  Banghua Zhu,et al.  Generalized Resilience and Robust Statistics , 2019, The Annals of Statistics.

[46]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[47]  Christine Nardini,et al.  Missing value estimation methods for DNA methylation data , 2019, Bioinform..

[48]  AimNet: Attention-based Learning for Missing Data Imputation , 2019 .

[49]  Jerry Li,et al.  Sever: A Robust Meta-Algorithm for Stochastic Optimization , 2018, ICML.

[50]  Aravindan Vijayaraghavan,et al.  Adversarially Robust Low Dimensional Representations , 2019, COLT.