论文信息 - Distributed Top-N local outlier detection in big data

Distributed Top-N local outlier detection in big data

The concept of Top-N local outlier that focuses on the detection of the N points with the largest Local Outlier Factor (LOF) score has been shown to be very effective for identifying outliers in big datasets. However, detecting Top-N local outliers is computationally expensive, since the computation of LOF scores for all data points requires a huge number of high complexity k-nearest neighbor (kNN) searches. In this work, we thus present the first distributed solution to tackle this problem of Top-N local outlier detection (DTOLF). First, DTOLF features an innovative safe elimination strategy that efficiently identifies dually-safe points, namely those that are guaranteed to (1) not be classified as Top-N outliers and (2) not be needed as neighbors of points residing on other machines. Therefore, it effectively minimizes both the processing and communication costs of the Top-N outlier detection process. Further, based on the well-accepted observation that strong correlations among attributes are prevalent in real world datasets, we propose correlation-aware optimization strategies that ensure the effectiveness of grid-based partitioning and of the safe elimination strategy in multi-dimensional datasets. Our extensive experimental evaluation on OpenStreetMap, SDSS, and TIGER datasets demonstrates the effectiveness of DTOLF — up to 10 times faster than the alternative methods and scaling to terabyte level datasets.

Lei Cao | Elke A. Rundensteiner | Yizhou Yan | Yizhou Yan | Lei Cao

[1] Douglas M. Hawkins. Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[2] Raymond T. Ng,et al. Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[3] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[4] Sridhar Ramaswamy,et al. Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[5] Anthony K. H. Tung,et al. Mining top-n local outliers in large databases , 2001, KDD '01.

[6] Jaideep Srivastava,et al. A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[7] Patrick Weber,et al. OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[8] Srinivasan Parthasarathy,et al. Distance-based outlier detection , 2010, Proc. VLDB Endow..

[9] Kanishka Bhaduri,et al. Algorithms for speeding up distance-based outlier detection , 2011, KDD.

[10] Beng Chin Ooi,et al. Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[11] Feifei Li,et al. Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[12] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[13] Minyi Guo,et al. Inverted Grid-Based kNN Query Processing with MapReduce , 2012, 2012 Seventh ChinaGrid Annual Conference.

[14] Claudio Sartori,et al. Distributed Strategies for Mining Outliers in Large Data Sets , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15] Arthur Zimek,et al. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[16] Ahmed Eldawy,et al. SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[17] W. M. Wood-Vasey,et al. THE SDSS-IV EXTENDED BARYON OSCILLATION SPECTROSCOPIC SURVEY: OVERVIEW AND EARLY DATA , 2015, 1508.04473.

[18] Lei Cao,et al. Distributed Local Outlier Detection in Big Data , 2017, KDD.

[19] Lei Cao,et al. Scalable Top-n Local Outlier Detection , 2017, KDD.