Biased-sampling of density-based local outlier detection algorithm

Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.

[1]  Ursula Gather,et al.  Identification of outliers in a one-way random effects model , 2003 .

[2]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[3]  V. Bhatt,et al.  An enhanced approach for LOF in data mining , 2013, 2013 International Conference on Green High Performance Computing (ICGHPC).

[4]  Etienne Stalmans,et al.  A framework for DNS based detection and mitigation of malware infections on a network , 2011, 2011 Information Security for South Africa.

[5]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[6]  Xue An Study on Algorithms for Local Outlier Detection , 2007 .

[7]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[8]  Chris Jermaine,et al.  Outlier detection by sampling with accuracy guarantees , 2006, KDD '06.

[9]  Xing Xie,et al.  Discovering spatio-temporal causal interactions in traffic data streams , 2011, KDD.

[10]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[11]  Sandeep Yadav,et al.  Detecting algorithmically generated malicious domain names , 2010, IMC '10.

[12]  Sanjay Chawla,et al.  SLOM: a new measure for local spatial outliers , 2006, Knowledge and Information Systems.

[13]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[14]  Xing Xie,et al.  Discovering regions of different functions in a city using human mobility and POIs , 2012, KDD.

[15]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[16]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[17]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[18]  Sun Zhihui,et al.  Local Entropy Based Weighted Subspace Outlier Mining Algorithm , 2008 .

[19]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[20]  Karsten M. Borgwardt,et al.  Rapid Distance-Based Outlier Detection via Sampling , 2013, NIPS.

[21]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.