Data Streaming with Affinity Propagation

This paper proposed StrAP (Streaming AP), extending Affinity Propagation (AP) to data steaming. AP, a new clustering algorithm, extracts the data items, or exemplars, that best represent the dataset using a message passing method. Several steps are made to build StrAP. The first one (Weighted AP) extends AP to weighted items with no loss of generality. The second one (Hierarchical WAP) is concerned with reducing the quadratic AP complexity, by applying AP on data subsets and further applying Weighted AP on the exemplars extracted from all subsets. Finally StrAP extends Hierarchical WAP to deal with changes in the data distribution. Experiments on artificial datasets, on the Intrusion Detection benchmark (KDD99) and on a real-world problem, clustering the stream of jobs submitted to the EGEE grid system, provide a comparative validation of the approach.

[1]  Sumedha,et al.  Clustering by soft-constraint affinity propagation: applications to gene-expression data , 2007, Bioinform..

[2]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[3]  Deepak K. Agarwal,et al.  An empirical Bayes approach to detect anomalies in dynamic multidimensional arrays , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[4]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[5]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[6]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[7]  Christos Faloutsos,et al.  Adaptive, Hands-Off Stream Mining , 2003, VLDB.

[8]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[9]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[10]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Sudipto Guha,et al.  Clustering data streams , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[12]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[13]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14]  D. Hinkley Inference about the change-point from cumulative sum tests , 1971 .

[15]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[16]  Divesh Srivastava,et al.  Finding hierarchical heavy hitters in streaming data , 2008, TKDD.

[17]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.