Clustering Evolving Data Stream with Affinity Propagation Algorithm

Clustering data stream is an active research area that has recently emerged to discover knowledge from large amounts of continuously generated data. Several data stream clustering algorithms have been proposed to perform unsupervised learning. Nevertheless, data stream clustering imposes several challenges to be addressed, such as dealing with dynamic data that arrive in an online fashion, capable of performing fast and incremental processing of data objects, and suitably addressing time and memory limitations. In this paper, we propose a semi-supervised clustering algorithm that extends Affinity Propagation (AP) to handle evolving data steam. We incorporate a set of labeled data items with set of exemplars to detect a change in the generative process underlying the data stream, which requires the stream model to be updated as soon as possible. Experimental results with state-of-the-art data stream clustering methods demonstrate the effectiveness and efficiency of the proposed method.

[1]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Michèle Sebag,et al.  Toward autonomic grids: analyzing the job flow with affinity streaming , 2009, KDD.

[4]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[5]  Raghu Ramakrishnan,et al.  Bellwether analysis: Searching for cost-effective query-defined predictors in large databases , 2009, TKDD.

[6]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[7]  Graham Cormode,et al.  Conquering the Divide: Continuous Clustering of Distributed Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[9]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[10]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[11]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[12]  Adam Meyerson,et al.  Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[13]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[14]  Zhou Zimu,et al.  RSSIからCSIへ:チャネルレスポンスによるインドア・ローカリゼーション , 2013 .

[15]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[16]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[17]  Li Tu,et al.  Stream data clustering based on grid density and attraction , 2009, TKDD.

[18]  Myra Spiliopoulou,et al.  C-DenStream: Using Domain Knowledge on a Data Stream , 2009, Discovery Science.

[19]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.