Bottom-Up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal

Variable selection is a problem of increasing interest in many areas of multivariate statistics such as classification, clustering and regression. In contradiction to supervised classification, variable selection in cluster analysis is a much more difficult problem because usually nothing is known about the true class structure. In addition, in clustering, variable selection is highly related to the main problem of the determination of the number of clusters K to be inherent in the data. Here we present a very general bottom-up approach to variable selection in clustering starting with univariate investigations of stability. The hope is that the structure of interest may be contained in only a small subset of variables. Very general means, we make only use of non-parametric resampling techniques for purposes of validation, where we are looking for clusters that can be reproduced to a high degree under resampling schemes. So, our proposed technique can be applied to almost any cluster analysis method.