Scaling up for high dimensional and high speed data streams: HSDStream

This paper presents a novel high speed clustering scheme for high dimensional data streams. Data stream clustering has gained importance in different applications, for example, in network monitoring, intrusion detection, and real-time sensing are few of those. High dimensional stream data is inherently more complex when used for clustering because the evolving nature of the stream data and high dimensionality make it non-trivial. In order to tackle this problem, projected subspace within the high dimensions and limited window sized data per unit of time are used for clustering purpose. We propose a High Speed and Dimensions data stream clustering scheme (HSDStream) which employs exponential moving averages to reduce the size of the memory and speed up the processing of projected subspace data stream. The proposed algorithm has been tested against HDDStream for cluster purity, memory usage, and the cluster sensitivity. Experimental results have been obtained for corrected KDD intrusion detection dataset. These results show that HSDStream outperforms the HDDStream in all performance metrics, especially the memory usage and the processing speed.

[1]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[2]  Wee Keong Ng,et al.  A survey on data stream clustering and classification , 2015, Knowledge and Information Systems.

[3]  Aoying Zhou,et al.  Efficient clustering of uncertain data streams , 2013, Knowledge and Information Systems.

[4]  Ying Wah Teh,et al.  On Density-Based Data Streams Clustering Algorithms: A Survey , 2014, Journal of Computer Science and Technology.

[5]  Charu C. Aggarwal A segment-based framework for modeling and mining data streams , 2010, Knowledge and Information Systems.

[6]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[7]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[8]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA , 2013, Advances in Data Analysis and Classification.

[9]  Won Suk Lee,et al.  Efficiently tracing clusters over high-dimensional on-line data streams , 2009, Data Knowl. Eng..

[10]  Hans-Peter Kriegel,et al.  Density-based Projected Clustering over High Dimensional Data Streams , 2012, SDM.

[11]  Thanapat Kangkachit,et al.  SED-Stream: discriminative dimension selection for evolution-based clustering of high dimensional data streams , 2014, Int. J. Intell. Syst. Technol. Appl..

[12]  Ying Wah Teh,et al.  MuDi-Stream: A multi density clustering algorithm for evolving data stream , 2016, J. Netw. Comput. Appl..

[13]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[14]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[15]  Philip S. Yu,et al.  Density-based clustering of data streams at multiple resolutions , 2009, TKDD.

[16]  Jesús S. Aguilar-Ruiz,et al.  A similarity-based approach for data stream classification , 2014, Expert Syst. Appl..

[17]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[18]  Thomas Seidl,et al.  Subspace clustering of data streams: new algorithms and effective evaluation measures , 2014, Journal of Intelligent Information Systems.

[19]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[20]  Weiguo Liu,et al.  Clustering Algorithm for High Dimensional Data Stream over Sliding Windows , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[21]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[22]  Thanapat Kangkachit,et al.  SE-Stream: Dimension Projection for Evolution-Based Clustering of High Dimensional Data Streams , 2013, KSE.

[23]  Giandomenico Spezzano,et al.  A single pass algorithm for clustering evolving data streams based on swarm intelligence , 2011, Data Mining and Knowledge Discovery.