Data Stream Clustering With Affinity Propagation

Data stream clustering provides insights into the underlying patterns of data flows. This paper focuses on selecting the best representatives from clusters of streaming data. There are two main challenges: how to cluster with the best representatives and how to handle the evolving patterns that are important characteristics of streaming data with dynamic distributions. We employ the Affinity Propagation (AP) algorithm presented in 2007 by Frey and Dueck for the first challenge, as it offers good guarantees of clustering optimality for selecting exemplars. The second challenging problem is solved by change detection. The presented StrAP algorithm combines AP with a statistical change point detection test; the clustering model is rebuilt whenever the test detects a change in the underlying data distribution. Besides the validation on two benchmark data sets, the presented algorithm is validated on a real-world application, monitoring the data flow of jobs submitted to the EGEE grid.

[1]  Kok-Leong Ong,et al.  Online mining of frequent sets in data streams with error guarantee , 2008, Knowledge and Information Systems.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[4]  Christos Faloutsos,et al.  Adaptive, Hands-Off Stream Mining , 2003, VLDB.

[5]  Graham Cormode,et al.  Conquering the Divide: Continuous Clustering of Distributed Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[6]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[7]  Michele Leone,et al.  Clustering by Soft-constraint Affinity Propagation: Applications to Gene-expression Data , 2022 .

[8]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[9]  Sheng Ma,et al.  Adaptive diagnosis in distributed systems , 2005, IEEE Transactions on Neural Networks.

[10]  Divesh Srivastava,et al.  Finding hierarchical heavy hitters in streaming data , 2008, TKDD.

[11]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[12]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[13]  Adam Meyerson,et al.  Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[14]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Ming-Syan Chen,et al.  Adaptive Clustering for Multiple Evolving Streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  K Raghuveer,et al.  Performance evaluation of data clustering techniques using KDD Cup-99 Intrusion detection data set , 2012 .

[17]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[18]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[19]  Michèle Sebag,et al.  Data Streaming with Affinity Propagation , 2008, ECML/PKDD.

[20]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints , 2011, IEEE Transactions on Knowledge and Data Engineering.

[21]  Michèle Sebag,et al.  Toward autonomic grids: analyzing the job flow with affinity streaming , 2009, KDD.

[22]  D. Hinkley Inference about the change-point from cumulative sum tests , 1971 .

[23]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[24]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[25]  Zaïd Harchaoui,et al.  Kernel Change-point Analysis , 2008, NIPS.

[26]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[27]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[28]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[29]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[30]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[31]  Silvia Nittel,et al.  Scaling clustering algorithms for massive data sets using data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[32]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[33]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[34]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[35]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[36]  Xiangliang Zhang,et al.  Processing of massive audit data streams for real-time anomaly intrusion detection , 2008, Comput. Commun..

[37]  Xiangliang Zhang,et al.  K-AP: Generating Specified K Clusters by Efficient Affinity Propagation , 2010, 2010 IEEE International Conference on Data Mining.

[38]  Philip S. Yu,et al.  Online Mining of Changes from Data Streams: Research Problems and Preliminary Results , 2003 .

[39]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[40]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.