Optimal outlier removal in high-dimensional spaces

We study the problem of finding an outlier-free subset of a set of points (or a probability distribution) in n-dimensional Euclidean space. As in [BFKV 99], a point x is defined to be a β-outlier if there exists some direction w in which its squared distance from the mean along w is greater than β times the average squared distance from the mean along w. Our main theorem is that for any e > 0, there exists a (1 - e) fraction of the original distribution that has no O(n/e(b + logn/e))-outliers, improving on the previous bound of O(n7b/e). This is asymptotically the best possible, as shown by a matching lower bound. The theorem is constructive, and results in a 1/1-e approximation to the following optimization problem: given a distribution µ (i.e. the ability to sample from it), and a parameter e > 0, find the minimum β for which there exists a subset of probability at least (1 - e) with no β-outliers.