Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles

Popular outlier detection methods require the pairwise comparison of objects to compute the nearest neighbors. This inherently quadratic problem is not scalable to large data sets, making multidimensional outlier detection for big data still an open challenge. Existing approximate neighbor search methods are designed to preserve distances as well as possible. In this article, we present a highly scalable approach to compute the nearest neighbors of objects that instead focuses on preserving neighborhoods well using an ensemble of space-filling curves. We show that the method has near-linear complexity, can be distributed to clusters for computation, and preserves neighborhoods—but not distances—better than established methods such as locality sensitive hashing and projection indexed nearest neighbors. Furthermore, we demonstrate that, by preserving neighborhoods, the quality of outlier detection based on local density estimates is not only well retained but sometimes even improved, an effect that can be explained by relating our method to outlier detection ensembles. At the same time, the outlier detection process is accelerated by two orders of magnitude.

[1]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2]  Alexandros Nanopoulos,et al.  Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[3]  Rémy Mullot,et al.  Mapping high dimensional features onto Hilbert curve: Applying to fast image retrieval , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[4]  Hans-Peter Kriegel,et al.  LoOP: local outlier probabilities , 2009, CIKM.

[5]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[6]  Srinivasan Parthasarathy,et al.  Distance-based outlier detection , 2010, Proc. VLDB Endow..

[7]  Hans-Peter Kriegel,et al.  Geodetic Distance Queries on R-Trees for Indexing Geographic Data , 2013, SSTD.

[8]  Arthur Zimek,et al.  Data perturbation for outlier detection ensembles , 2014, SSDBM '14.

[9]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[10]  Hans-Peter Kriegel,et al.  Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection , 2012, Data Mining and Knowledge Discovery.

[11]  Hans-Peter Kriegel,et al.  Generalized Outlier Detection with Flexible Kernel Density Estimates , 2014, SDM.

[12]  D. Hilbert Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[13]  Vivekanand Gopalkrishnan,et al.  Efficient Pruning Schemes for Distance-Based Outlier Detection , 2009, ECML/PKDD.

[14]  Arthur R. Butz,et al.  Alternative Algorithm for Hilbert's Space-Filling Curve , 1971, IEEE Transactions on Computers.

[15]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[16]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[17]  Shirish Tatikonda,et al.  Locality Sensitive Outlier Detection: A ranking driven approach , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[19]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[20]  Sanjay Chawla,et al.  Density-preserving projections for large-scale local anomaly detection , 2012, Knowledge and Information Systems.

[21]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[22]  Nimrod Megiddo,et al.  Fast indexing method for multidimensional nearest-neighbor search , 1998, Electronic Imaging.

[23]  J. Matousek,et al.  On variants of the Johnson–Lindenstrauss lemma , 2008 .

[24]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[25]  Arthur Zimek,et al.  Subsampling for efficient and effective unsupervised outlier detection ensembles , 2013, KDD.

[26]  V. Zolotarev One-dimensional stable distributions , 1986 .

[27]  Arthur Zimek,et al.  Ensembles for unsupervised outlier detection: challenges and research questions a position paper , 2014, SKDD.

[28]  Elke Achtert,et al.  Interactive data mining with 3D-parallel-coordinate-trees , 2013, SIGMOD '13.

[29]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[30]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[31]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[32]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[33]  Amin Vahdat,et al.  TritonSort: A Balanced Large-Scale Sorting System , 2011, NSDI.

[34]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[35]  G. Peano Sur une courbe, qui remplit toute une aire plane , 1890 .

[36]  Hans-Peter Kriegel,et al.  On Evaluation of Outlier Rankings and Outlier Scores , 2012, SDM.

[37]  Suresh Venkatasubramanian,et al.  The Johnson-Lindenstrauss Transform: An Empirical Study , 2011, ALENEX.

[38]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[39]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[40]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[41]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[42]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[43]  Mario A. López,et al.  High dimensional similarity search with space filling curves , 2001, Proceedings 17th International Conference on Data Engineering.

[44]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[45]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[46]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[47]  Timothy M. Chan Approximate Nearest Neighbor Queries Revisited , 1998, Discret. Comput. Geom..

[48]  Charu Agarwal,et al.  Outlier ensembles , 2013, ODD '13.

[49]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[50]  Ata Kabán,et al.  On the distance concentration awareness of certain data reduction techniques , 2011, Pattern Recognit..

[51]  Charu C. Aggarwal,et al.  Outlier ensembles: position paper , 2013, SKDD.

[52]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.