Semi-Supervised Clustering Using Genetic Algorithms

A semi-supervised clustering algorithm is proposed that combines the benefits of supervised and unsupervised learning methods. The approach allows unlabeled data with no known class to be used to improve classification accuracy. The objective function of an unsupervised technique, e.g. K-means clustering, is modified to minimize both the cluster dispersion of the input attributes and a measure of cluster impurity based on the class labels. Minimizing the cluster dispersion of the examples is a form of capacity control to prevent overfitting. For the the output labels, impurity measures from decision tree algorithms such as the Gini index can be used. A genetic algorithm optimizes the objective function to produce clusters. Experimental results show that using class information improves the generalization ability compared to unsupervised methods based only on the input attributes.