Clustering trees with instance level constraints (Extended abstract)

Clustering methods partition a given set of instances into subsets (clusters) such that the instances in a given cluster are similar. Traditional clustering algorithms, such as k-means and hierarchical agglomerative clustering, are unsupervised, that is, they only have access to the attributes describing each instance; no direct information about the actual assignment of instances to clusters is available. This distinguishes clustering from supervised classification, where the class of each instance is given. Over the past five years, constrained clustering methods have become popular, motivated by applications such as gene clustering, document clustering, web search result clustering, and lane finding from GPS traces. Constrained clustering investigates how domain knowledge can improve clustering performance. Domain knowledge is given as a set of constraints that must hold on the clusters. We consider two common types of instance level (IL) constraints (Fig. 1.a): must-link and cannot-link constraints [3]. A must-link constraint ML(a,b) specifies that instances a andb must belong to the same cluster, and a cannot-link constraint CL( a,b) specifies that a andb must not be placed in the same cluster. IL constraints provide additional information about the assignment of instances to clusters. Clustering with IL constraints is therefore considered to be a form of semi-supervised learning. IL constraints have been successfully incorporated into popular clustering algorithms, such as k-means [3]. This paper investigates how clustering trees can support IL constraints. Clustering trees are decision trees that are used for clus(a)