On Robust Mean Estimation under Coordinate-level Corruption

We study the problem of robust mean estimation and introduce a novel Hamming distance-based measure of distribution shift for coordinate-level corruptions. We show that this measure yields adversary models that capture more realistic corruptions than those used in prior works, and present an information-theoretic analysis of robust mean estimation techniques in these settings. We show that for structured distributions, methods that leverage the structure yield more accurate mean estimation. Finally, we introduce a novel two-step meta-algorithm for robust mean estimation that first fixes corruptions in the input data and then performs robust mean estimation. We demonstrate in real-world data with missing values that our two-step approach outperforms existing robust estimation methods and provides accurate mean estimation even in high-magnitude corruption settings.

[1]  Christos Tzamos,et al.  Efficient Statistics, in High Dimensions, from Truncated Samples , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[2]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[3]  Christos Tzamos,et al.  Efficient Truncated Statistics with Unknown Truncation , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[4]  Thinh P. Nguyen,et al.  Performance analysis for matrix completion via iterative hard-thresholded SVD , 2014, 2014 IEEE Workshop on Statistical Signal Processing (SSP).

[5]  Jerry Li,et al.  How Hard Is Robust Mean Estimation? , 2019, COLT.

[6]  John C. Duchi,et al.  Learning Models with Uniform Performance via Distributionally Robust Optimization , 2018, ArXiv.

[7]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[8]  Gregory Valiant,et al.  Learning from untrusted data , 2016, STOC.

[9]  David C. Swanson Signal Processing for Intelligent Sensor Systems , 2000 .

[10]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[11]  Ankur Moitra,et al.  Algorithms and Hardness for Robust Subspace Recovery , 2012, COLT.

[12]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[13]  Banghua Zhu,et al.  Generalized Resilience and Robust Statistics , 2019, The Annals of Statistics.

[14]  André R. S. Marçal,et al.  Evaluation of Features for Leaf Discrimination , 2013, ICIAR.

[15]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[16]  Jerry Li,et al.  Sever: A Robust Meta-Algorithm for Stochastic Optimization , 2018, ICML.

[17]  K. Cios,et al.  Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome , 2015, PloS one.

[18]  Lingxiao Wang,et al.  Robust Gaussian Graphical Model Estimation with Arbitrary Corruption , 2017, ICML.

[19]  Sivaraman Balakrishnan,et al.  Robust estimation via robust gradient estimation , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[20]  Theodoros Rekatsinas,et al.  HoloDetect: Few-Shot Learning for Error Detection , 2019, SIGMOD Conference.

[21]  Guy Van den Broeck,et al.  What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features , 2019, IJCAI.

[22]  Daniel M. Kane,et al.  Recent Advances in Algorithmic High-Dimensional Robust Statistics , 2019, ArXiv.

[23]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[24]  David P. Woodruff,et al.  Faster Algorithms for High-Dimensional Robust Covariance Estimation , 2019, COLT.

[25]  Chao Gao Robust regression via mutivariate regression depth , 2017, Bernoulli.

[26]  Pravesh Kothari,et al.  Embedding Hard Learning Problems into Gaussian Space , 2014, Electron. Colloquium Comput. Complex..

[27]  Qinfeng Shi,et al.  Sensor enabled wearable RFID technology for mitigating the risk of falls near beds , 2013, 2013 IEEE International Conference on RFID (RFID).

[28]  Nigel Boston,et al.  A characterization of deterministic sampling patterns for low-rank matrix completion , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[29]  Ilias Diakonikolas,et al.  Efficient Algorithms and Lower Bounds for Robust Linear Regression , 2018, SODA.

[30]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[31]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[32]  I-Cheng Yeh,et al.  Knowledge discovery on RFM model using Bernoulli sequence , 2009, Expert Syst. Appl..

[33]  Jerry Li,et al.  Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[34]  Pravesh Kothari,et al.  Efficient Algorithms for Outlier-Robust Regression , 2018, COLT.

[35]  Dimitris Bertsimas,et al.  From Predictive Methods to Missing Data Imputation: An Optimization Approach , 2017, J. Mach. Learn. Res..