Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream

Anomaly detection is currently an important and active research problem in many fields and involved in numerous applications. Most of the existing methods are based on distance measure. But in case of data stream these methods are not very efficient as computational point of view. Most of the exiting work on outlier detection in data stream declare a point as an outlier/inlier as soon as it arrive due to limited memory resources as compared to the huge data stream, to declare an outlier as it arrive often can lead us to a wrong decision, because of dynamic nature of the incoming data. In this paper we introduced a clustering based approach, which divide the stream in chunks and cluster each chunk using k-mean in fixed number of clusters. Instead of keeping only the summary information, which often used in case of clustering data stream, we keep the candidate outliers and mean value of every cluster for the next fixed number of steam chunks, to make sure that the detected candidate outliers are the real outliers. By employing the mean value of the clusters of previous chunk with mean values of the current chunk of stream, we decide better outlierness for data stream objects. Several experiments on different dataset confirm that our technique can find better outliers with low computational cost than the other exiting distance based approaches of outlier detection in data stream.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  LeeWon Suk,et al.  Statistical grid-based clustering over data streams , 2004 .

[3]  Eleazar Eskin,et al.  A GEOMETRIC FRAMEWORK FOR UNSUPERVISED ANOMALY DETECTION: DETECTING INTRUSIONS IN UNLABELED DATA , 2002 .

[4]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[5]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[7]  Pasi Fränti,et al.  Randomised Local Search Algorithm for the Clustering Problem , 2000, Pattern Analysis & Applications.

[8]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[9]  Fabrizio Angiulli,et al.  Detecting distance-based outliers in streams of data , 2007, CIKM '07.

[10]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[11]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[12]  Aleksandar Lazarevic,et al.  Incremental Local Outlier Detection for Data Streams , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[13]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[14]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[16]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[17]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[18]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.

[19]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[20]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[21]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[22]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[23]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[24]  Clara Pizzuti,et al.  Distance-based detection and prediction of outliers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[25]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[26]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[27]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[28]  Maria E. Orlowska,et al.  Can exclusive clustering on streaming data be achieved? , 2006, SKDD.

[29]  Todd L. Heberlein,et al.  Network intrusion detection , 1994, IEEE Network.