Detection of Local Outlier over Dynamic Data Streams Using Efficient Partitioning Method

Outlier detection is the process of detecting the data objects which are grossly different from or inconsistent with the remaining set of data. Some of the important applications in the field of data mining are fraud detection, customer behavior analysis, and intrusion detection. There are number of good research algorithms for detecting outliers if the entire data is available and algorithms can operate in more than single passes to achieve the required results. Among the existing methods, LOF (Local outlier Factor) a density based method is very efficient in detecting all forms of outliers. LOF algorithm can not be directly applied to the datastream as the large number of nearest neighbor searches, LOF computation and lrd (local reachability distances) can make it highly inefficient for datastream. In this paper we propose a cluster based partitioning algorithm which can divide the stream in safe region and candidate regions. In Second phase apply LOF algorithm over these partitions separately with some slight enhancement for LOF computation over candidate region to achieve accurate results for finding most outstanding outliers. Several experiments on different dataset confirm that our technique can find better outliers with low computational cost than the direct LOF or compared to the other enhancements proposed for LOF.

[1]  A. Madansky Identification of Outliers , 1988 .

[2]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[3]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.

[4]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[5]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[6]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[7]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[8]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[9]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[10]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[11]  Ji Zhang,et al.  Grid-ODF: Detecting Outliers Effectively and Efficiently in Large Multi-dimensional Databases , 2005, CIS.

[12]  Fabrizio Angiulli,et al.  Detecting distance-based outliers in streams of data , 2007, CIKM '07.

[13]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[14]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[15]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[16]  Ada Wai-Chee Fu,et al.  Enhancements on local outlier detection , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[17]  Clara Pizzuti,et al.  Distance-based detection and prediction of outliers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[18]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[19]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[20]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[21]  Eleazar Eskin,et al.  A GEOMETRIC FRAMEWORK FOR UNSUPERVISED ANOMALY DETECTION: DETECTING INTRUSIONS IN UNLABELED DATA , 2002 .

[22]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[23]  Hongjun Lu,et al.  False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams , 2004, VLDB.

[24]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[25]  LeeWon Suk,et al.  Statistical grid-based clustering over data streams , 2004 .