Distributed Top-N local outlier detection in big data

The concept of Top-N local outlier that focuses on the detection of the N points with the largest Local Outlier Factor (LOF) score has been shown to be very effective for identifying outliers in big datasets. However, detecting Top-N local outliers is computationally expensive, since the computation of LOF scores for all data points requires a huge number of high complexity k-nearest neighbor (kNN) searches. In this work, we thus present the first distributed solution to tackle this problem of Top-N local outlier detection (DTOLF). First, DTOLF features an innovative safe elimination strategy that efficiently identifies dually-safe points, namely those that are guaranteed to (1) not be classified as Top-N outliers and (2) not be needed as neighbors of points residing on other machines. Therefore, it effectively minimizes both the processing and communication costs of the Top-N outlier detection process. Further, based on the well-accepted observation that strong correlations among attributes are prevalent in real world datasets, we propose correlation-aware optimization strategies that ensure the effectiveness of grid-based partitioning and of the safe elimination strategy in multi-dimensional datasets. Our extensive experimental evaluation on OpenStreetMap, SDSS, and TIGER datasets demonstrates the effectiveness of DTOLF — up to 10 times faster than the alternative methods and scaling to terabyte level datasets.

[1]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[2]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[3]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[4]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[5]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[6]  Jaideep Srivastava,et al.  A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[7]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[8]  Srinivasan Parthasarathy,et al.  Distance-based outlier detection , 2010, Proc. VLDB Endow..

[9]  Kanishka Bhaduri,et al.  Algorithms for speeding up distance-based outlier detection , 2011, KDD.

[10]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[11]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[12]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[13]  Minyi Guo,et al.  Inverted Grid-Based kNN Query Processing with MapReduce , 2012, 2012 Seventh ChinaGrid Annual Conference.

[14]  Claudio Sartori,et al.  Distributed Strategies for Mining Outliers in Large Data Sets , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[16]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[17]  W. M. Wood-Vasey,et al.  THE SDSS-IV EXTENDED BARYON OSCILLATION SPECTROSCOPIC SURVEY: OVERVIEW AND EARLY DATA , 2015, 1508.04473.

[18]  Lei Cao,et al.  Distributed Local Outlier Detection in Big Data , 2017, KDD.

[19]  Lei Cao,et al.  Scalable Top-n Local Outlier Detection , 2017, KDD.