StreamXM: An Adaptive Partitional Clustering Solution for Evolving Data Streams

A challenge imposed by continuously arriving data streams is to analyze them and to modify the models that explain them as new data arrives. We propose StreamXM, a stream clustering technique that does not require an arbitrary selection of number of clusters, repeated and expensive heuristics or in-depth prior knowledge of the data to create an informed clustering that relates to the data. It allows a clustering that can adapt its number of classes to those present in the underlying distribution. In this paper, we propose two different variants of StreamXM and compare them against a current, state-of-the-art technique, StreamKM. We evaluate our proposed techniques using both synthetic and real world datasets. From our results, we show StreamXM and StreamKM run in similar time and with similar accuracy when running with similar numbers of clusters. We show our algorithms can provide superior stream clustering if true clusters are not known or if emerging or disappearing concepts will exist within the data stream.

[1]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[2]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[3]  Chang-Dong Wang,et al.  SVStream: A Support Vector-Based Algorithm for Clustering Data Streams , 2013, IEEE Transactions on Knowledge and Data Engineering.

[4]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[5]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[6]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[7]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[9]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[10]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[11]  Christopher Leckie,et al.  Unsupervised Anomaly Detection in Network Intrusion Detection Using Clusters , 2005, ACSC.

[12]  Mohamed Medhat Gaber,et al.  Data Stream Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[13]  Wei Hu,et al.  Twitter spammer detection using data stream clustering , 2014, Inf. Sci..

[14]  Vijay Hanagandi,et al.  Density-based clustering and radial basis function modeling to generate credit card fraud scores , 1996, IEEE/IAFE 1996 Conference on Computational Intelligence for Financial Engineering (CIFEr).