Constrained data clustering by depth control and progressive constraint relaxation

In order to import the domain knowledge or application-dependent parameters into the data mining systems, constraint-based mining has attracted a lot of research attention recently. In this paper, the attributes employed to model the constraints are called constraint attributes and those attributes involved in the objective function to be optimized are called optimization attributes. The constrained clustering considered in this paper is conducted in such a way that the objective function of optimization attributes is optimized subject to the condition that the imposed constraint is satisfied. Explicitly, we address the problem of constrained clustering with numerical constraints, in which the constraint attribute values of any two data items in the same cluster are required to be within the corresponding constraint range. This numerical constrained clustering problem, however, cannot be dealt with by any conventional clustering algorithms. Consequently, we devise several effective and efficient algorithms to solve such a clustering problem. It is noted that due to the intrinsic nature of the numerical constrained clustering, there is an order dependency on the process of attaining the clustering, which in many cases degrades the clustering results. In view of this, we devise a progressive constraint relaxation technique to remedy this drawback and improve the overall performance of clustering results. Explicitly, by using a smaller (tighter) constraint range in earlier iterations of merge, we will have more room to relax the constraint and seek for better solutions in subsequent iterations. It is empirically shown that the progressive constraint relaxation technique is able to improve not only the execution efficiency but also the clustering quality.

[1]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[2]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[3]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD '00.

[4]  Ickjai Lee,et al.  AUTOCLUST+: Automatic Clustering of Point-Data Sets in the Presence of Obstacles , 2000, TSDM.

[5]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[6]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[7]  Anthony K. H. Tung,et al.  Spatial clustering in the presence of obstacles , 2001, Proceedings 17th International Conference on Data Engineering.

[8]  Ming-Syan Chen,et al.  On the Optimal Clustering of Sequential Data , 2002, SDM.

[9]  Ming-Syan Chen,et al.  A robust and efficient clustering algorithm based on cohesion self-merging , 2002, KDD.

[10]  Boon-Lock Yeo,et al.  Time-constrained clustering for segmentation of video into story units , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[11]  King-Sun Fu,et al.  A Sentence-to-Sentence Clustering Procedure for Pattern Analysis , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  Yen-Jen Oyang,et al.  A Study on the Hierarchical Data Clustering Algorithm Based on Gravity Theory , 2001, PKDD.

[13]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[14]  Anthony K. H. Tung,et al.  Constraint-based clustering in large databases , 2001, ICDT.

[15]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[16]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[17]  Maurice D. Mulvenna,et al.  Discovering Internet marketing intelligence through online analytical web usage mining , 1998, SGMD.

[18]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[19]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[20]  Geoffrey C. Fox,et al.  Constrained Clustering as an Optimization Method , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[22]  RICHARD C. DUBES,et al.  How many clusters are best? - An experiment , 1987, Pattern Recognit..

[23]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[24]  H. V. Jagadish,et al.  Semantic Compression and Pattern Extraction with Fascicles , 1999, VLDB.

[25]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[26]  Chi-Hoon Lee,et al.  On Data Clustering Analysis: Scalability, Constraints, and Validation , 2002, PAKDD.