Careful Seeding Method based on Independent Components Analysis for k-means Clustering

The k-means clustering method is a widely used clustering technique for the Web because of its simplicity and speed. However, the clustering result depends heavily on the chosen initial clustering centers, which are uniformly chosen at random from the data points. We propose a seeding method that is based on the independent component analysis for the k-means clustering method. We evaluate the performance of our proposed method and compare it with other seeding methods by using benchmark datasets. We also applied our proposed method to a Web corpus, which was provided by ODP, and the CLUTO datasets. The results from the experiments showed that the normalized mutual information of our proposed method is better than the normalized mutual information of the k-means clustering method, the KKZ method, and the k-means++ clustering method.

[1]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[2]  Roberto Navigli,et al.  Inducing Word Senses to Improve Web Search Result Clustering , 2010, EMNLP.

[3]  Takashi Onoda,et al.  Careful Seeding Based on Independent Component Analysis for k-Means Clustering , 2010 .

[4]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[5]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[6]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[7]  Tao Qin,et al.  Web image clustering by consistent utilization of visual features and surrounding texts , 2005, MULTIMEDIA '05.

[8]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[9]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[10]  Paolo Ferragina,et al.  A personalized search engine based on Web‐snippet hierarchical clustering , 2005, WWW '05.

[11]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[12]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[13]  Seiji Yamada,et al.  Independent Component Analysis Based Seeding Method for K-Means Clustering , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[14]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[15]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[16]  Xiaoying Gao,et al.  Query Directed Web Page Clustering , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[17]  Seiji Yamada,et al.  Seeding Method based on Independent Component Analysis for k -Means Clustering , 2010 .

[18]  Aapo Hyvärinen,et al.  A Fast Fixed-Point Algorithm for Independent Component Analysis , 1997, Neural Computation.

[19]  Xin Chen,et al.  Exploit the tripartite network of social tagging for web clustering , 2009, CIKM.

[20]  D. Chakrabarti,et al.  A fast fixed - point algorithm for independent component analysis , 1997 .

[21]  Masaru Kitsuregawa,et al.  Evaluating contents-link coupled web page clustering for web search results , 2002, CIKM '02.

[22]  Benno Stein,et al.  Beyond precision@10: clustering the long tail of web search results , 2011, CIKM '11.

[23]  Peng Li,et al.  User-Related Tag Expansion for Web Document Clustering , 2011, ECIR.