A Novel High Dimensional and High Speed Data Streams Algorithm: HSDStream

This paper presents a novel high speed clustering scheme for high-dimensional data stream. Data stream clustering has gained importance in different applications, for example, network monitoring, intrusion detection, and real-time sensing. High dimensional stream data is inherently more complex when used for clustering because the evolving nature of the stream data and high dimensionality make it non-trivial. In order to tackle this problem, projected subspace within the high dimensions and limited window sized data per unit of time are used for clustering purpose. We propose a High Speed and Dimensions data stream clustering scheme (HSDStream) which employs exponential mov-ing averages to reduce the size of the memory and speed up the processing of projected subspace data stream. It works in three steps: i) initialization, ii) real-time maintenance of core and outlier micro-clusters, and iii) on-demand offline generation of the final clusters. The proposed algorithm is tested against high dimensional density-based projected clustering (HDDStream) for cluster purity, memory usage, and the cluster sensitivity. Experi-mental results are obtained for corrected KDD intrusion detection dataset. These results show that HSDStream outperforms the HDDStream in all performance metrics, especially, the memory usage and the processing speed.

[1]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[2]  Thanapat Kangkachit,et al.  SE-Stream: Dimension Projection for Evolution-Based Clustering of High Dimensional Data Streams , 2013, KSE.

[3]  Won Suk Lee,et al.  Efficiently tracing clusters over high-dimensional on-line data streams , 2009, Data Knowl. Eng..

[4]  Giandomenico Spezzano,et al.  A single pass algorithm for clustering evolving data streams based on swarm intelligence , 2011, Data Mining and Knowledge Discovery.

[5]  Thanapat Kangkachit,et al.  SED-Stream: discriminative dimension selection for evolution-based clustering of high dimensional data streams , 2014, Int. J. Intell. Syst. Technol. Appl..

[6]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[7]  Ying Wah Teh,et al.  MuDi-Stream: A multi density clustering algorithm for evolving data stream , 2016, J. Netw. Comput. Appl..

[8]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA , 2013, Advances in Data Analysis and Classification.

[9]  Hans-Peter Kriegel,et al.  Density-based Projected Clustering over High Dimensional Data Streams , 2012, SDM.

[10]  Aoying Zhou,et al.  Efficient clustering of uncertain data streams , 2013, Knowledge and Information Systems.

[11]  Ying Wah Teh,et al.  On Density-Based Data Streams Clustering Algorithms: A Survey , 2014, Journal of Computer Science and Technology.

[12]  Charu C. Aggarwal A segment-based framework for modeling and mining data streams , 2010, Knowledge and Information Systems.

[13]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[14]  Thomas Seidl,et al.  Subspace clustering of data streams: new algorithms and effective evaluation measures , 2014, Journal of Intelligent Information Systems.

[15]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[16]  Weiguo Liu,et al.  Clustering Algorithm for High Dimensional Data Stream over Sliding Windows , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[17]  Jesús S. Aguilar-Ruiz,et al.  A similarity-based approach for data stream classification , 2014, Expert Syst. Appl..

[18]  Philip S. Yu,et al.  Density-based clustering of data streams at multiple resolutions , 2009, TKDD.

[19]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[21]  Wee Keong Ng,et al.  A survey on data stream clustering and classification , 2015, Knowledge and Information Systems.

[22]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[23]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).