Linear-Time Outlier Detection via Sensitivity

Outliers are ubiquitous in modern data sets. Distance-based techniques are a popular non-parametric approach to outlier detection as they require no prior assumptions on the data generating distribution and are simple to implement. Scaling these techniques to massive data sets without sacrificing accuracy is a challenging task. We propose a novel algorithm based on the intuition that outliers have a significant influence on the quality of divergence-based clustering solutions. We propose sensitivity - the worst-case impact of a data point on the clustering objective - as a measure of outlierness. We then prove that influence, a (non-trivial) upper-bound on the sensitivity, can be computed by a simple linear time algorithm. To scale beyond a single machine, we propose a communication efficient distributed algorithm. In an extensive experimental evaluation, we demonstrate the effectiveness and establish the statistical significance of the proposed approach. In particular, it outperforms the most popular distance-based approaches while being several orders of magnitude faster.

[1]  Karsten M. Borgwardt,et al.  Rapid Distance-Based Outlier Detection via Sampling , 2013, NIPS.

[2]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[3]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[4]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[5]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[6]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[7]  Chris Jermaine,et al.  Outlier detection by sampling with accuracy guarantees , 2006, KDD '06.

[8]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[9]  Osmar R. Zaïane,et al.  An Efficient Reference-Based Approach to Outlier Detection in Large Datasets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[10]  Jude W. Shavlik,et al.  Gleaner: Creating ensembles of first-order clauses to improve recall-precision curves , 2006, Machine Learning.

[11]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[12]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[13]  Habib Hamam,et al.  Artificial Intelligence Review , 2019, Advanced Methodologies and Technologies in Artificial Intelligence, Computer Simulation, and Human-Computer Interaction.

[14]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[15]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[16]  Srinivasan Parthasarathy,et al.  Distance-based outlier detection , 2010, Proc. VLDB Endow..

[17]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC '11.

[18]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[19]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[20]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[21]  Kanishka Bhaduri,et al.  Algorithms for speeding up distance-based outlier detection , 2011, KDD.

[22]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[23]  Tomi Kinnunen,et al.  Improving K-Means by Outlier Removal , 2005, SCIA.

[24]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[25]  D. Steinberg,et al.  Technometrics , 2008 .

[26]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[27]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[28]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[29]  Andrew W. Moore,et al.  New Algorithms for Efficient High-Dimensional Nonparametric Classification , 2006, J. Mach. Learn. Res..

[30]  Dock Bumpers,et al.  Volume 2 , 2005, Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005..

[31]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.