Quantum Entropy Scoring for Fast Robust Mean Estimation and Improved Outlier Detection

We study two problems in high-dimensional robust statistics: \emph{robust mean estimation} and \emph{outlier detection}. In robust mean estimation the goal is to estimate the mean $\mu$ of a distribution on $\mathbb{R}^d$ given $n$ independent samples, an $\varepsilon$-fraction of which have been corrupted by a malicious adversary. In outlier detection the goal is to assign an \emph{outlier score} to each element of a data set such that elements more likely to be outliers are assigned higher scores. Our algorithms for both problems are based on a new outlier scoring method we call QUE-scoring based on \emph{quantum entropy regularization}. For robust mean estimation, this yields the first algorithm with optimal error rates and nearly-linear running time $\widetilde{O}(nd)$ in all parameters, improving on the previous fastest running time $\widetilde{O}(\min(nd/\varepsilon^6, nd^2))$. For outlier detection, we evaluate the performance of QUE-scoring via extensive experiments on synthetic and real data, and demonstrate that it often performs better than previously proposed algorithms. Code for these experiments is available at this https URL .

[1]  F. J. Anscombe,et al.  Rejection of Outliers , 1960 .

[2]  J. Tukey A survey of sampling from contaminated distributions , 1960 .

[3]  Thomas S. Ferguson,et al.  On the Rejection of Outliers , 1961 .

[4]  J. Tukey Mathematics and the Picturing of Data , 1975 .

[5]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[6]  J. Lindenstrauss,et al.  Extensions of lipschitz maps into Banach spaces , 1986 .

[7]  A. Madansky Identification of Outliers , 1988 .

[8]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[9]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[10]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[11]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[12]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[13]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[14]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[15]  Awad H. Al-Mohy,et al.  A New Scaling and Squaring Algorithm for the Matrix Exponential , 2009, SIAM J. Matrix Anal. Appl..

[16]  David Steurer,et al.  Fast SDP algorithms for constraint satisfaction problems , 2010, SODA '10.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[19]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21]  Zeyuan Allen Zhu,et al.  Spectral Sparsification and Regret Minimization Beyond Matrix Multiplicative Updates , 2015, STOC.

[22]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[23]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[24]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[25]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[26]  Daniel M. Kane,et al.  Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[27]  Jerry Li,et al.  Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[28]  Daniel M. Kane,et al.  Statistical Query Lower Bounds for Robust Estimation of High-Dimensional Gaussians and Gaussian Mixtures , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[29]  Jacob Steinhardt,et al.  ROBUST LEARNING: INFORMATION THEORY AND ALGORITHMS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY , 2018 .

[30]  Jerry Zheng Li,et al.  Principled approaches to robust machine learning and beyond , 2018 .

[31]  David P. Woodruff,et al.  Faster Algorithms for High-Dimensional Robust Covariance Estimation , 2019, COLT.

[32]  Yu Cheng,et al.  High-Dimensional Robust Mean Estimation in Nearly-Linear Time , 2018, SODA.

[33]  Jerry Li,et al.  How Hard Is Robust Mean Estimation? , 2019, COLT.

[34]  G. Lugosi,et al.  Sub-Gaussian estimators of the mean of a random vector , 2017, The Annals of Statistics.

[35]  G. Lecu'e,et al.  Robust sub-Gaussian estimation of a mean vector in nearly linear time , 2019, The Annals of Statistics.