An Automatic Clustering Algorithm and Its Properties in High-Dimensional Spaces

An economical technique for approximating a joint N-dimensional probability density function has been described by Sebestyen and Edie [20]. The algorithm searches for clusters of points and considers each cluster as one hyperellipsoidal cell in an N-dimensional histogram. Among the advantages of this scheme are: 1) the histogram cell descriptors-location, shape, and size-can be determined adaptively from sequentially introduced data samples of known classification and, 2) the number of cells required for a good fit can usually be held to a small number. No assumptions are required about the underlying statistical structure of the data. The algorithm requires three types of "control parameters" which critically affect its performance and are dependent upon the number of dimensions. The three factors control the birth, shape, and growth rate of the cells. Guides were presented in [20] for choosing the control parameter values. These guides functioned well for spaces of 3 dimensions or less, but did not yield usable values for spaces of greater dimensionality. This paper presents heuristics which were developed to automate the selection of the control parameters. The properties of these parameters were studied as a function of dimension. Two of the control parameters were found to be linearly related to dimension. This provides a method for determining their value by extrapolation, thereby avoiding a great deal of computation.