A Semi-supervised Incremental Clustering Algorithm for Streaming Data

Nowadays many applications need to deal with evolving data streams . In this work, we propose an incremental clustering approach for the exploitation of user constraints on data streams. Conventional constraints do not make sense on streaming data, so we extend the classic notion of constraint set into a constraint stream . We propose methods for using the constraint stream as data items are forgotten or new items arrive. Also we present an on-line clustering approach for the cost-based enforcement of the constraints during cluster adaptation on evolving data streams. Our method introduces the concept of multi-clusters (m-clusters) to capture arbitrarily shaped clusters. An m-cluster consists of multiple dense overlapping regions, named s-clusters, each of which can be efficiently represented by a single point. Also it proposes the definition of outliers clusters in order to handle outliers while it provides methods to observe changes in structure of clusters as data evolves.

[1]  Michèle Sebag,et al.  Data Streaming with Affinity Propagation , 2008, ECML/PKDD.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Myra Spiliopoulou,et al.  C-DenStream: Using Domain Knowledge on a Data Stream , 2009, Discovery Science.

[4]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[5]  Dimitrios Gunopulos,et al.  A framework for semi-supervised learning based on subjective and objective clustering criteria , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Johannes Fürnkranz,et al.  Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, September 18-22, 2006, Proceedings , 2006, PKDD.

[7]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[8]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[9]  Myra Spiliopoulou,et al.  Constraint-Based Query Clustering , 2007, AWIC.

[10]  Luís Torgo,et al.  Knowledge Discovery in Databases: PKDD 2005, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings , 2005, PKDD.

[11]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[12]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[13]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[14]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[15]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[16]  Claire Cardie,et al.  Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[17]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[18]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.