论文信息 - Quantum Entropy Scoring for Fast Robust Mean Estimation and Improved Outlier Detection

Quantum Entropy Scoring for Fast Robust Mean Estimation and Improved Outlier Detection

We study two problems in high-dimensional robust statistics: \emph{robust mean estimation} and \emph{outlier detection}. In robust mean estimation the goal is to estimate the mean $\mu$ of a distribution on $\mathbb{R}^d$ given $n$ independent samples, an $\varepsilon$-fraction of which have been corrupted by a malicious adversary. In outlier detection the goal is to assign an \emph{outlier score} to each element of a data set such that elements more likely to be outliers are assigned higher scores. Our algorithms for both problems are based on a new outlier scoring method we call QUE-scoring based on \emph{quantum entropy regularization}. For robust mean estimation, this yields the first algorithm with optimal error rates and nearly-linear running time $\widetilde{O}(nd)$ in all parameters, improving on the previous fastest running time $\widetilde{O}(\min(nd/\varepsilon^6, nd^2))$. For outlier detection, we evaluate the performance of QUE-scoring via extensive experiments on synthetic and real data, and demonstrate that it often performs better than previously proposed algorithms. Code for these experiments is available at this https URL .

Samuel B. Hopkins | Yihe Dong | Jerry Li | Yihe Dong | Jungshian Li

[1] F. J. Anscombe,et al. Rejection of Outliers , 1960 .

[2] J. Tukey. A survey of sampling from contaminated distributions , 1960 .

[3] Thomas S. Ferguson,et al. On the Rejection of Outliers , 1961 .

[4] J. Tukey. Mathematics and the Picturing of Data , 1975 .

[5] Frederick R. Forst,et al. On robust estimation of the location parameter , 1980 .

[6] J. Lindenstrauss,et al. Extensions of lipschitz maps into Banach spaces , 1986 .

[7] A. Madansky. Identification of Outliers , 1988 .

[8] Raymond T. Ng,et al. A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[9] Raymond T. Ng,et al. Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[10] Katrien van Driessen,et al. A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[11] P. Rousseeuw,et al. A fast algorithm for the minimum covariance determinant estimator , 1999 .

[12] Sridhar Ramaswamy,et al. Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[13] Zhi-Hua Zhou,et al. Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[14] Bernard Chazelle,et al. The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[15] Awad H. Al-Mohy,et al. A New Scaling and Squaring Algorithm for the Matrix Exponential , 2009, SIAM J. Matrix Anal. Appl..

[16] David Steurer,et al. Fast SDP algorithms for constraint satisfaction problems , 2010, SODA '10.

[17] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18] Sanjeev Arora,et al. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[19] M. Rudelson,et al. Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[20] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21] Zeyuan Allen Zhu,et al. Spectral Sparsification and Regret Minimization Beyond Matrix Multiplicative Updates , 2015, STOC.

[22] Arthur Zimek,et al. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[23] Joel A. Tropp,et al. An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[24] Alexandr Andoni,et al. Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[25] Santosh S. Vempala,et al. Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[26] Daniel M. Kane,et al. Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[27] Jerry Li,et al. Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[28] Daniel M. Kane,et al. Statistical Query Lower Bounds for Robust Estimation of High-Dimensional Gaussians and Gaussian Mixtures , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[29] Jacob Steinhardt,et al. ROBUST LEARNING: INFORMATION THEORY AND ALGORITHMS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY , 2018 .

[30] Jerry Zheng Li,et al. Principled approaches to robust machine learning and beyond , 2018 .

[31] David P. Woodruff,et al. Faster Algorithms for High-Dimensional Robust Covariance Estimation , 2019, COLT.

[32] Yu Cheng,et al. High-Dimensional Robust Mean Estimation in Nearly-Linear Time , 2018, SODA.

[33] Jerry Li,et al. How Hard Is Robust Mean Estimation? , 2019, COLT.

[34] G. Lugosi,et al. Sub-Gaussian estimators of the mean of a random vector , 2017, The Annals of Statistics.

[35] G. Lecu'e,et al. Robust sub-Gaussian estimation of a mean vector in nearly linear time , 2019, The Annals of Statistics.