Clustering Stream Data by Exploring the Evolution of Density Mountain

Stream clustering is a fundamental problem in many streaming data analysis applications. Comparing to classical batch-mode clustering, there are two key challenges in stream clustering: (i) Given that input data are changing continuously, how to incrementally update clustering results efficiently? (ii) Given that clusters continuously evolve with the evolution of data, how to capture the cluster evolution activities? Unfortunately, most of existing stream clustering algorithms can neither update the cluster result in real time nor track the evolution of clusters. In this paper, we propose an stream clustering algorithm EDMStream by exploring the Evolution of Density Mountain. The density mountain is used to abstract the data distribution, the changes of which indicate data distribution evolution. We track the evolution of clusters by monitoring the changes of density mountains. We further provide efficient data structures and filtering schemes to ensure the update of density mountains in real time, which makes online clustering possible. The experimental results on synthetic and real datasets show that, comparing to the state-of-the-art stream clustering algorithms, e.g., D-Stream, DenStream, DBSTREAM and MR-Stream, our algorithm can response to a cluster update much faster (say 7-15x faster than the best of the competitors) and at the same time achieve comparable cluster quality. Furthermore, EDMStream can successfully capture the cluster evolution activities.

[1]  Wenke Lee,et al.  Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection : Results from the JAM Project ∗ , 2008 .

[2]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[3]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[4]  Yufei Tao,et al.  Dynamic Density Based Clustering , 2017, SIGMOD Conference.

[5]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[6]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[7]  Myra Spiliopoulou,et al.  MONIC: modeling and monitoring cluster transitions , 2006, KDD '06.

[8]  Yufei Tao,et al.  DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation , 2015, SIGMOD Conference.

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  Didier Stricker,et al.  Creating and benchmarking a new dataset for physical activity monitoring , 2012, PETRA '12.

[11]  Thomas Seidl,et al.  An effective evaluation measure for clustering on evolving data streams , 2011, KDD.

[12]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[13]  Didier Stricker,et al.  Introducing a New Benchmarked Dataset for Activity Monitoring , 2012, 2012 16th International Symposium on Wearable Computers.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Michael Hahsler,et al.  Clustering Data Streams Based on Shared Density between Micro-Clusters , 2016, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[17]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[18]  Yaling Pei,et al.  A Synthetic Data Generator for Clustering and Outlier Analysis , 2006 .

[19]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[20]  Michèle Sebag,et al.  Data Stream Clustering With Affinity Propagation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[21]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[22]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[23]  Philip S. Yu,et al.  Density-based clustering of data streams at multiple resolutions , 2009, TKDD.

[24]  Eamonn J. Keogh,et al.  Rare Time Series Motif Discovery from Unbounded Streams , 2014, Proc. VLDB Endow..

[25]  Ge Yu,et al.  Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce , 2016, IEEE Trans. Knowl. Data Eng..

[26]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[27]  João Gama,et al.  A framework to monitor clusters evolution applied to economy and finance problems , 2012, Intell. Data Anal..

[28]  Malik Ranasinghe,et al.  Estimating willingness to pay for urban water supply: a comparison of artificial neural networks and multiple regression analysis , 1999 .

[29]  Soujanya Vadapalli,et al.  SynDECA: A Tool to Generate Synthetic Datasets for Evaluation of Clustering Algorithms , 2005, COMAD.

[30]  Gong Shufeng Zhang Yanfeng,et al.  EDDPC: An Efficient Distributed Density Peaks Clustering Algorithm , 2016 .

[31]  Lei Cao,et al.  Interactive Outlier Exploration in Big Data Streams , 2014, Proc. VLDB Endow..

[32]  Hao Huang,et al.  Streaming Anomaly Detection Using Randomized Matrix Sketching , 2015, Proc. VLDB Endow..

[33]  Michael Hahsler,et al.  SOStream: Self Organizing Density-Based Clustering over Data Stream , 2012, MLDM.