A Multi Density-Based Clustering Algorithm for Data Stream with Noise

Density-based clustering can detect arbitrary shape clusters, handle outliers and do not need the number of clusters in advance. However, they cannot work properly in multi density environments. The existing multi density clustering algorithms have some problems in order to be applicable for data streams such as the need of whole data to perform clustering, two-pass clustering and high execution time. Data stream arrives continuously and they have to be processed in limited time and memory. Therefore, we need an algorithm to cluster data stream with different densities as well as to overcome the challenges in clustering data streams. In this paper, we introduce a Multi-Density clustering algorithm for data stream called MuDi-Stream. MuDi-Stream is an online-offline clustering algorithm, in which the online phase forms core-mini-clusters using a new proposed core distance and offline phase clusters the core-mini-clusters based on a density-based method. The new core distance called mini core distance is calculated based on the number of neighboring data points around the core. Therefore, the algorithm has different core distances for different clusters that leads to cover multi density environments. A novel pruning strategy is also used to filter out the real data from the noise by mapping the outliers in the grid. The grid structure keeps the neighbors of the data point to determine mini-core distance and remove noise effectively. Our performance study over synthetic data sets demonstrates effectiveness of our method.

[1]  Sharma Chakravarthy,et al.  Clustering data streams using grid-based synopsis , 2013, Knowledge and Information Systems.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Xiaoyun Chen,et al.  An Improved Semi-Supervised Clustering Algorithm for Multi-Density Datasets with Fewer Constraints , 2012 .

[4]  Philip S. Yu,et al.  Density-based clustering of data streams at multiple resolutions , 2009, TKDD.

[5]  Ying Wah Teh,et al.  On Density-Based Data Streams Clustering Algorithms: A Survey , 2014, Journal of Computer Science and Technology.

[6]  Alfredo Ferro,et al.  Enhancing density-based clustering: Parameter reduction and outlier detection , 2013, Inf. Syst..

[7]  Hassan Abolhassani,et al.  MSDBSCAN: Multi-density Scale-Independent Clustering Algorithm Based on DBSCAN , 2010, ADMA.

[8]  Ying Wah Teh,et al.  A study of density-grid based clustering algorithms on data streams , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[9]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[10]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[11]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[12]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[13]  Daniel Barbará,et al.  Requirements for clustering data streams , 2002, SKDD.

[14]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[15]  Teh Ying Wah,et al.  Density Micro-Clustering Algorithms on Data Streams: A Review , 2011 .

[16]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[17]  GiugnoRosalba,et al.  Enhancing density-based clustering , 2013 .

[18]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[19]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[21]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[22]  Amin Namadchian,et al.  DSCLU: A New Data Stream Clustring Algorithm for Multi Density Environments , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[23]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[24]  Mauro Birattari,et al.  Swarm Intelligence , 2012, Lecture Notes in Computer Science.

[25]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[26]  Willie Ng,et al.  Discovery of Frequent Patterns in Transactional Data Streams , 2010, Trans. Large Scale Data Knowl. Centered Syst..

[27]  Li Tu,et al.  Stream data clustering based on grid density and attraction , 2009, TKDD.

[28]  Mohsen Sayyadi,et al.  GDCLU: A New Grid-Density Based ClustrIng Algorithm , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[29]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[30]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[31]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[32]  Giandomenico Spezzano,et al.  A single pass algorithm for clustering evolving data streams based on swarm intelligence , 2011, Data Mining and Knowledge Discovery.