On Robust Mean Estimation under Coordinate-level Corruption

We study the problem of robust mean estimation and introduce a novel Hamming distance-based measure of distribution shift for coordinate-level corruptions. We show that this measure yields adversary models that capture more realistic corruptions than those used in prior works, and present an information-theoretic analysis of robust mean estimation in these settings. We show that for structured distributions, methods that leverage the structure yield information theoretically more accurate mean estimation. We also focus on practical algorithms for robust mean estimation and study when data cleaning-inspired approaches that first fix corruptions in the input data and then perform robust mean estimation can match the information theoretic bounds of our analysis. We finally demonstrate experimentally that this two-step approach outperforms structure-agnostic robust estimation and provides accurate mean estimation even for high-magnitude corruption.

[1]  Ilias Diakonikolas,et al.  Efficient Algorithms and Lower Bounds for Robust Linear Regression , 2018, SODA.

[2]  Christos Tzamos,et al.  Efficient Statistics, in High Dimensions, from Truncated Samples , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[3]  André R. S. Marçal,et al.  Evaluation of Features for Leaf Discrimination , 2013, ICIAR.

[4]  Nigel Boston,et al.  A characterization of deterministic sampling patterns for low-rank matrix completion , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[5]  Indranil Gupta,et al.  Phocas: dimensional Byzantine-resilient stochastic gradient descent , 2018, ArXiv.

[6]  John C. Duchi,et al.  Learning Models with Uniform Performance via Distributionally Robust Optimization , 2018, ArXiv.

[7]  I-Cheng Yeh,et al.  Knowledge discovery on RFM model using Bernoulli sequence , 2009, Expert Syst. Appl..

[8]  AimNet: Attention-based Learning for Missing Data Imputation , 2019 .

[9]  Gregory Valiant,et al.  Learning from untrusted data , 2016, STOC.

[10]  David P. Woodruff,et al.  Faster Algorithms for High-Dimensional Robust Covariance Estimation , 2019, COLT.

[11]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[12]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[13]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[14]  Anru Zhang,et al.  Sharp RIP bound for sparse signal and low-rank matrix recovery , 2013 .

[15]  J. Tukey Mathematics and the Picturing of Data , 1975 .

[16]  Guy Van den Broeck,et al.  What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features , 2019, IJCAI.

[17]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[18]  Kenneth Burdett,et al.  Truncated means and variances , 1996 .

[19]  Sivaraman Balakrishnan,et al.  Robust estimation via robust gradient estimation , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[20]  Aravindan Vijayaraghavan,et al.  Adversarially Robust Low Dimensional Representations , 2021, COLT.

[21]  Qinfeng Shi,et al.  Sensor enabled wearable RFID technology for mitigating the risk of falls near beds , 2013, 2013 IEEE International Conference on RFID (RFID).

[22]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[23]  Daniel M. Kane,et al.  Recent Advances in Algorithmic High-Dimensional Robust Statistics , 2019, ArXiv.

[24]  Christos Tzamos,et al.  Efficient Truncated Statistics with Unknown Truncation , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[25]  Gilad Lerman,et al.  Robust Subspace Recovery with Adversarial Outliers , 2019, ArXiv.

[26]  Pravesh Kothari,et al.  Efficient Algorithms for Outlier-Robust Regression , 2018, COLT.

[27]  Po-Ling Loh,et al.  High-dimensional robust precision matrix estimation: Cellwise corruption under $\epsilon$-contamination , 2015, 1509.07229.

[28]  Jerry Li,et al.  Sever: A Robust Meta-Algorithm for Stochastic Optimization , 2018, ICML.

[29]  Michael B. Wakin,et al.  Analysis of Orthogonal Matching Pursuit Using the Restricted Isometry Property , 2009, IEEE Transactions on Information Theory.

[30]  Christine Nardini,et al.  Missing value estimation methods for DNA methylation data , 2019, Bioinform..

[31]  Pravesh Kothari,et al.  Embedding Hard Learning Problems into Gaussian Space , 2014, Electron. Colloquium Comput. Complex..

[32]  Elwyn R. Berlekamp,et al.  On the inherent intractability of certain coding problems (Corresp.) , 1978, IEEE Trans. Inf. Theory.

[33]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[34]  Chao Gao Robust regression via mutivariate regression depth , 2017, Bernoulli.

[35]  David C. Swanson Signal Processing for Intelligent Sensor Systems , 2000 .

[36]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[37]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[38]  Lingxiao Wang,et al.  Robust Gaussian Graphical Model Estimation with Arbitrary Corruption , 2017, ICML.

[39]  Marek Karpinski,et al.  Approximating minimum unsatisfiability of linear equations , 2002, SODA '02.

[40]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[41]  K. Cios,et al.  Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[42]  Jerry Li,et al.  Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[43]  Jerry Li,et al.  How Hard Is Robust Mean Estimation? , 2019, COLT.

[44]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[45]  Ankur Moitra,et al.  Algorithms and Hardness for Robust Subspace Recovery , 2012, COLT.

[46]  Dimitris Bertsimas,et al.  From Predictive Methods to Missing Data Imputation: An Optimization Approach , 2017, J. Mach. Learn. Res..

[47]  Banghua Zhu,et al.  Generalized Resilience and Robust Statistics , 2019, The Annals of Statistics.

[48]  Jakub W. Pachocki,et al.  Geometric median in nearly linear time , 2016, STOC.

[49]  Thinh P. Nguyen,et al.  Performance analysis for matrix completion via iterative hard-thresholded SVD , 2014, 2014 IEEE Workshop on Statistical Signal Processing (SSP).

[50]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).