Clustering With Constraints Using Graph Based Approach

Clustering can be considered as the most important unsupervised learning problem which deals with finding a structure in a collection of unlabeled data. To this end, it conducts a process of organizing objects into groups whose members are similar in some way and dissimilar to those of other groups [1]. While this process yields in an entirely unsupervised manner, additional background information (namely constraints) are available in some domains and must be considered in the clustering solutions. These latter vary from the user and the domain but we are usually interested to the use of background information in the form of instance-level must-link and cannot-link constraints. A must-link constraint enforces that two instances must be placed in the same cluster while a cannot-link constraint enforces that two instances must not be placed in the same cluster. Setting these constraints requires some modifications in the clustering algorithms which is not always feasible. Many authors investigated the use of constraints in clustering problem. In [2], the authors have proposed a modified version of COBWEB clustering algorithm that uses background information about pairs of instances to constrain their cluster placement. Equally, a recent work [3] has looked at extending the ubiquitous k-Means algorithm to incorporate the same types of instance-level hard constraints (must-link and cannot-link). Recently, we have proposed a new clustering approach [4] based on the concept of b-coloring of a graph [5]. It exhibits more important clustering features and enables to build a fine partition of the data set (numeric or symbolic) in clusters when the number of clusters is not specified beforehand. A graph b-coloring is the assignment of colors (clusters) to the vertices of the graph such that (i) no two adjacent vertices have the same color (proper coloring), (ii) for each color there exists at least one dominating vertex which is adjacent to all the other colors. This specific vertex reflects the properties of the class and also guarantees that the class has a distinct separation from all other classes of the partitioning. In this paper, we are interested in ways to integrate background information into the b-coloring based clustering algorithm. The proposed algorithm which we will refer to as COP-b-coloring (for constraint portioning bcoloring) is evaluated against benchmark data sets and the results of this study indicate the effectiveness of the instance-level hard constraints to offer real benefits (accuracy and runtime) for clustering problem.