StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

Data stream clustering is an important problem of data mining. As the infinite growth of data stream’s length, excessive data is making great troubles to the storage of data. A number of algorithms have been proposed for data stream clustering, such as CluStream, DenStream, DStream and StrAP. With the Big Data era’s coming, the amount of data in one timestamp is growing at a great speed, so the time efficiency of data stream clustering algorithms is drawing huge attention from researchers while some state-of-the-art algorithms are excellent in cluster purity but intolerable in time efficiency. In this paper, we propose the StrDip, a fast data stream clustering algorithm which combines the Dip Test of Unimodality with the online/offline two-stage stream clustering framework. The StrDip also adapts a novel clustering feature vector and some microcluster pruning methods. Comparing to others algorithms, results of experiments on synthetic and real-world datasets show that, the StrDip gains a huge advantage in time efficiency and the clustering purity and quality are also good.

[1]  Jason J. Jung,et al.  Real-time Event Detection on Social Data Stream , 2014, Mobile Networks and Applications.

[2]  Michèle Sebag,et al.  Data Stream Clustering With Affinity Propagation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[3]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[4]  Piotr Duda,et al.  How to adjust an ensemble size in stream data mining? , 2017, Inf. Sci..

[5]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[6]  Sudipto Guha,et al.  Clustering data streams , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[7]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[8]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[9]  Dmitry Namiot,et al.  On Big Data Stream Processing , 2015 .

[10]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[11]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[12]  Gang Zhao,et al.  Effective Clustering Algorithm for Probabilistic Data Stream: Effective Clustering Algorithm for Probabilistic Data Stream , 2010 .

[13]  Claudia Plant,et al.  Skinny-dip: Clustering in a Sea of Noise , 2016, KDD.

[14]  Sharma Chakravarthy,et al.  Clustering data streams using grid-based synopsis , 2013, Knowledge and Information Systems.

[15]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Dai Dong Effective Clustering Algorithm for Probabilistic Data Stream , 2009 .

[17]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[18]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[19]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[20]  Latifur Khan,et al.  IoT Big Data Stream Mining , 2016, KDD.

[21]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[22]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[23]  O. P. Vyas,et al.  Data Stream Mining: A Review on Windowing Approach , 2012 .

[24]  J. Hartigan,et al.  The Dip Test of Unimodality , 1985 .

[25]  Jin-Yin Chen,et al.  A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data , 2016, Inf. Sci..

[26]  Jennifer Widom,et al.  STREAM: The Stanford Data Stream Management System , 2016, Data Stream Management.

[27]  Hao Huang,et al.  Streaming spectral clustering , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[28]  Alessandro Margara,et al.  Processing flows of information: From data stream to complex event processing , 2012, CSUR.