Density-based Projected Clustering over High Dimensional Data Streams

Clustering of high dimensional data streams is an important problem in many application domains, a prominent example being network monitoring. Several approaches have been lately proposed for solving independently the different aspects of the problem. There exist methods for clustering over full dimensional streams and methods for finding clusters in subspaces of high dimensional static data. Yet only a few approaches have been proposed so far which tackle both the stream and the high dimensionality aspects of the problem simultaneously. In this work, we propose a new density-based projected clustering algorithm, HDDStream, for high dimensional data streams. Our algorithm summarizes both the data points and the dimensions where these points are grouped together and maintains these summaries online, as new points arrive over time and old points expire due to ageing. Our experimental results illustrate the effectiveness and the efficiency of HDDStream and also demonstrate that it could serve as a trigger for detecting drastic changes in the underlying stream population, like bursts of network attacks.

[1]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[2]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[3]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[4]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[5]  Arthur Zimek,et al.  Clustering High-Dimensional Data , 2018, Data Clustering: Algorithms and Applications.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Martin Ester,et al.  Robust projected clustering , 2007, Knowledge and Information Systems.

[8]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[9]  Jing Gao,et al.  An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection , 2005, PAKDD.

[10]  Man Lung Yiu,et al.  Iterative projected clustering by subspace mining , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[12]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[13]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[14]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[15]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[16]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[17]  Qi Zhang,et al.  Incremental Subspace Clustering over Multiple Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[19]  Ira Assent,et al.  Self-Adaptive Anytime Stream Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[20]  Hans-Peter Kriegel,et al.  Density Based Subspace Clustering over Dynamic Data , 2011, SSDBM.

[21]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[22]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..