The performance of K-means clustering depends on the initial guess of partition. We motivate theoretically and experimentally the use of a deterministic divisive hierarchical method, which we refer to as PCA-Part (principal components analysis partitioning) for initialization. The criterion that K-means clustering minimizes is the SSE (sum-squared-error) criterion. The first principal direction (the eigenvector corresponding to the largest eigenvalue of the covariance matrix) is the direction which contributes the largest SSE. Hence, a good candidate direction to project a cluster for splitting is, then, the first principal direction. This is the basis for PCA-Part initialization method. Our experiments reveal that generally PCA-Part leads K-means to generate clusters with SSE values close to the minimum SSE values obtained by one hundred random start runs. In addition, this deterministic initialization method often leads K-means to faster convergence (less iterations) compared to random methods. Furthermore, we also theoretically show and confirm experimentally on synthetic data when PCA-Part may fail.
[1]
David G. Stork,et al.
Pattern classification, 2nd Edition
,
2000
.
[2]
David G. Stork,et al.
Pattern Classification
,
1973
.
[3]
D. Rubin,et al.
Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper
,
1977
.
[4]
Paul S. Bradley,et al.
Refining Initial Points for K-Means Clustering
,
1998,
ICML.
[5]
Christopher J. Merz,et al.
UCI Repository of Machine Learning Databases
,
1996
.
[6]
Gene H. Golub,et al.
Matrix computations
,
1983
.
[7]
Daniel Boley,et al.
Principal Direction Divisive Partitioning
,
1998,
Data Mining and Knowledge Discovery.
[8]
Michael R. Anderberg,et al.
Cluster Analysis for Applications
,
1973
.