Discovering Clusters with Arbitrary Shapes and Densities in Data Streams

The availability of streaming data in different fields and in various forms increases the importance of streaming data analysis. The huge size of a continuously flowing data has put forward a number of challenges in data stream analysis. Exploration of the structure of streamed data represented a major challenge that resulted in introducing various clustering algorithms. However, current clustering algorithms still lack the ability to efficiently discover clusters of arbitrary densities in data streams. In this paper, a new grid-based and density-based algorithm is proposed for clustering streaming data. It addresses drawbacks of recent algorithms in discovering clusters of arbitrary densities. The algorithm uses an online component to map the input data to grid cells. An offline component is then used to cluster the grid cells based on density information. Relative density relatedness measures and a dynamic range neighborhood are proposed to differentiate clusters of arbitrary densities. The experimental evaluation shows considerable improvements upon the state-of-the-art algorithms in both clustering quality and scalability. In addition, the output quality of the proposed algorithm is less sensitive to parameter selection errors.

[1]  Mohamed A. Ismail,et al.  A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities , 2009, Pattern Recognit..

[2]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Li Tu,et al.  Stream data clustering based on grid density and attraction , 2009, TKDD.

[4]  Claudio Sartori,et al.  Stream Clustering Based on Kernel Density Estimation , 2006, ECAI.

[5]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[6]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[7]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[8]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[9]  Jian Pei,et al.  Granularity Adaptive Density Estimation and on Demand Clustering of Concept-Drifting Data Streams , 2006, DaWaK.

[10]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[11]  Mihai Lazarescu,et al.  Connectivity Based Stream Clustering Using Localised Density Exemplars , 2008, PAKDD.

[12]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[13]  Yanchun Zhang,et al.  Frontiers of WWW Research and Development - APWeb 2006, 8th Asia-Pacific Web Conference, Harbin, China, January 16-18, 2006, Proceedings , 2006, APWeb.

[14]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Philip S. Yu,et al.  Density-based clustering of data streams at multiple resolutions , 2009, TKDD.