论文信息 - A deterministic method for initializing K-means clustering

A deterministic method for initializing K-means clustering

The performance of K-means clustering depends on the initial guess of partition. We motivate theoretically and experimentally the use of a deterministic divisive hierarchical method, which we refer to as PCA-Part (principal components analysis partitioning) for initialization. The criterion that K-means clustering minimizes is the SSE (sum-squared-error) criterion. The first principal direction (the eigenvector corresponding to the largest eigenvalue of the covariance matrix) is the direction which contributes the largest SSE. Hence, a good candidate direction to project a cluster for splitting is, then, the first principal direction. This is the basis for PCA-Part initialization method. Our experiments reveal that generally PCA-Part leads K-means to generate clusters with SSE values close to the minimum SSE values obtained by one hundred random start runs. In addition, this deterministic initialization method often leads K-means to faster convergence (less iterations) compared to random methods. Furthermore, we also theoretically show and confirm experimentally on synthetic data when PCA-Part may fail.

Ting Su | Jennifer G. Dy | Ting Su

[1] David G. Stork,et al. Pattern classification, 2nd Edition , 2000 .

[2] David G. Stork,et al. Pattern Classification , 1973 .

[3] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4] Paul S. Bradley,et al. Refining Initial Points for K-Means Clustering , 1998, ICML.

[5] Christopher J. Merz,et al. UCI Repository of Machine Learning Databases , 1996 .

[6] Gene H. Golub,et al. Matrix computations , 1983 .

[7] Daniel Boley,et al. Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[8] Michael R. Anderberg,et al. Cluster Analysis for Applications , 1973 .