The k-means clustering method is a widely used clustering technique for the Web because of its simplicity and speed. However, the clustering result depends heavily on the chosen initial clustering centers, which are chosen uniformly at random from the data points. We propose a seeding method based on the independent component analysis for the k-means clustering method. We evaluate the performance of our proposed method and compare it with other seeding methods by using benchmark datasets. We applied our proposed method to a Web corpus, which is provided by ODP. The experiments show that the normalized mutual information of our proposed method is better than the normalized mutual information of k-means clustering method and k-means++ clustering method. Therefore, the proposed method is useful for Web corpus.
[1]
C. Müller,et al.
Large-scale clustering of cDNA-fingerprinting data.
,
1999,
Genome research.
[2]
Alan M. Frieze,et al.
Clustering Large Graphs via the Singular Value Decomposition
,
2004,
Machine Learning.
[3]
Sergei Vassilvitskii,et al.
k-means++: the advantages of careful seeding
,
2007,
SODA '07.
[4]
Nabil H. Mustafa,et al.
k-means projective clustering
,
2004,
PODS.
[5]
Pavel Berkhin,et al.
A Survey of Clustering Data Mining Techniques
,
2006,
Grouping Multidimensional Data.
[6]
S. P. Lloyd,et al.
Least squares quantization in PCM
,
1982,
IEEE Trans. Inf. Theory.