Data Stream Clustering With Affinity Propagation

Data stream clustering provides insights into the underlying patterns of data flows. This paper focuses on selecting the best representatives from clusters of streaming data. There are two main challenges: how to cluster with the best representatives and how to handle the evolving patterns that are important characteristics of streaming data with dynamic distributions. We employ the Affinity Propagation (AP) algorithm presented in 2007 by Frey and Dueck for the first challenge, as it offers good guarantees of clustering optimality for selecting exemplars. The second challenging problem is solved by change detection. The presented StrAP algorithm combines AP with a statistical change point detection test; the clustering model is rebuilt whenever the test detects a change in the underlying data distribution. Besides the validation on two benchmark data sets, the presented algorithm is validated on a real-world application, monitoring the data flow of jobs submitted to the EGEE grid.

[1]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[2]  Adam Meyerson,et al.  Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[3]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints , 2011, IEEE Transactions on Knowledge and Data Engineering.

[4]  Xiangliang Zhang,et al.  K-AP: Generating Specified K Clusters by Efficient Affinity Propagation , 2010, 2010 IEEE International Conference on Data Mining.

[5]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[6]  Michèle Sebag,et al.  Toward autonomic grids: analyzing the job flow with affinity streaming , 2009, KDD.

[7]  Lawrence K. Saul,et al.  Identifying suspicious URLs: an application of large-scale online learning , 2009, ICML '09.

[8]  Zaïd Harchaoui,et al.  Kernel Change-point Analysis , 2008, NIPS.

[9]  Kok-Leong Ong,et al.  Online mining of frequent sets in data streams with error guarantee , 2008, Knowledge and Information Systems.

[10]  João Gama,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[11]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[12]  Sumedha,et al.  Clustering by soft-constraint affinity propagation: applications to gene-expression data , 2007, Bioinform..

[13]  Graham Cormode,et al.  Conquering the Divide: Continuous Clustering of Distributed Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[14]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[15]  Ming-Syan Chen,et al.  Adaptive Clustering for Multiple Evolving Streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Sheng Ma,et al.  Adaptive diagnosis in distributed systems , 2005, IEEE Transactions on Neural Networks.

[17]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[18]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[19]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[20]  Silvia Nittel,et al.  Scaling clustering algorithms for massive data sets using data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[21]  Christos Faloutsos,et al.  Adaptive, Hands-Off Stream Mining , 2003, VLDB.

[22]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[23]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[24]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[25]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[26]  Sudipto Guha,et al.  Clustering data streams , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[27]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[28]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[29]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[30]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[31]  D. Hinkley Inference about the change-point from cumulative sum tests , 1971 .

[32]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[33]  M. Sebag,et al.  Data Streaming with Affinity Propagation , 2008 .

[34]  Divesh Srivastava,et al.  Finding hierarchical heavy hitters in streaming data , 2008, TKDD.

[35]  Xiangliang Zhang,et al.  Processing of massive audit data streams for real-time anomaly intrusion detection , 2008, Comput. Commun..

[36]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[37]  Philip S. Yu,et al.  Online Mining of Changes from Data Streams: Research Problems and Preliminary Results , 2003 .

[38]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.