Feature Selection in k-Median Clustering

An effective method for selecting features in clustering unlabeled data is proposed based on changing the objective function of the standard k-median clustering algorithm. The change consists of perturbing the objective function by a term that drives the medians of each of the k clusters toward the (shifted) global median of zero for the entire dataset. As the perturbation parameter is increased, more and more features are driven automatically toward the global zero median and are eliminated from the problem until one last feature remains. An error curve for unlabeled data clustering as a function of the number of features used gives reducedfeature clustering error relative to the “gold standard” of the full-feature clustering. This clustering error curve parallels a classification error curve based on real data labels. This justifies the utility of the former error curve for unlabeled data as a means of choosing an appropriate number of reduced features in order to achieve a correctness comparable to that obtained by the full set of original features. For example, on the 3-class Wine dataset, clustering with 4 selected input space features is comparable to within 4% to clustering using the original 13 features of the problem.