Despite its popularity for general clustering, K-means suuers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the rst two problems, and a partial remedy for the third. Building on prior work for algorithmic acceleration that is not based on approximation, we introduce a new algorithm that eeciently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. The innovations include two new ways of exploiting cached suucient statistics and a new very eecient test that in one K-means sweep selects the most promising subset of classes for reenement. This gives rise to a fast, statistically founded algorithm that outputs both the number of classes and their parameters. Experiments show this technique reveals the true number of classes in the underlying distribution , and that it is much faster than repeatedly using accelerated K-means for different values of K.
[1]
Richard O. Duda,et al.
Pattern classification and scene analysis
,
1974,
A Wiley-Interscience publication.
[2]
Heekuck Oh,et al.
Neural Networks for Pattern Recognition
,
1993,
Adv. Comput..
[3]
R. Ng,et al.
Eecient and Eeective Clustering Methods for Spatial Data Mining
,
1994
.
[4]
S. Shectman,et al.
The Las Campanas Redshift Survey
,
1996,
astro-ph/9604167.
[5]
Andrew W. Moore,et al.
Multiresolution Instance-Based Learning
,
1995,
IJCAI.
[6]
L. Wasserman,et al.
A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion
,
1995
.
[7]
Hans-Peter Kriegel,et al.
A Database Interface for Clustering in Large Spatial Databases
,
1995,
KDD.
[8]
Tian Zhang,et al.
BIRCH: an efficient data clustering method for very large databases
,
1996,
SIGMOD '96.
[9]
Alexander S. Szalay,et al.
The Sloan Digital Sky Survey
,
1999,
Comput. Sci. Eng..