C-DBSCAN: Density-Based Clustering with Constraints

Density-based clustering methods are of particular interest for applications where the anticipated groups of data instances are expected to differ in size or shape, arbitrary shapes are possible and the number of clusters is not known a priori. In such applications, background knowledge about group-membership or non-membership of some instances may be available and its exploitation so interesting. Recently, such knowledge is being expressed as constraints and exploited in constraint-based clustering. In this paper, we enhance the density-based algorithm DBSCAN with constraints upon data instances --- "Must-Link" and "Cannot-Link" constraints. We test the new algorithm C-DBSCAN on artificial and real datasets and show that C-DBSCAN has superior performance to DBSCAN, even when only a small number of constraints is available.

[1]  Ioannis Vlahavas,et al.  Methods and Applications of Artificial Intelligence , 2002, Lecture Notes in Computer Science.

[2]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[3]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[4]  Nikolaos M. Avouris,et al.  The Role of Domain Knowledge in a Large Scale Data Mining Project , 2002, SETN.

[5]  Dimitrios Gunopulos,et al.  A framework for semi-supervised learning based on subjective and objective clustering criteria , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Richard M. Everson,et al.  Intelligent Data Engineering and Automated Learning – IDEAL 2004 , 2004, Lecture Notes in Computer Science.

[7]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[8]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[9]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[10]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[11]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[12]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[13]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[14]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[15]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[16]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[17]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[18]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[19]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[20]  Luís Torgo,et al.  Knowledge Discovery in Databases: PKDD 2005, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings , 2005, PKDD.

[21]  Clara Pizzuti,et al.  DESCRY: A Density Based Clustering Algorithm for Very Large Data Sets , 2004, IDEAL.

[22]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .