Incremental Spectral Clustering With Application to Monitoring of Evolving Blog Communities

In recent years, spectral clustering method has gained attentions because of its superior performance compared to other traditional clustering algorithms such as K-means algorithm. The existing spectral clustering algorithms are all off-line algorithms, i.e., they can not incrementally update the clustering result given a small change of the data set. However, the capability of incrementally updating is essential to some applications such as real time monitoring of the evolving communities of websphere or blogsphere. Unlike traditional stream data, these applications require incremental algorithms to handle not only insertion/deletion of data points but also similarity changes between existing items. This paper extends the standard spectral clustering to such evolving data by introducing the incidence vector/matrix to represent two kinds of dynamics in the same framework and by incrementally updating the eigenvalue system. Our incremental algorithm, initialized by a standard spectral clustering, continuously and efficiently updates the eigenvalue system and generates instant cluster labels, as the data set is evolving. The algorithm is applied to a blog data set. Compared with recomputation of the solution by standard spectral clustering, it achieves similar accuracy but with much lower computational cost. Close inspection into the blog content shows that the incremental approach can discover not only the stable blog communities but also the evolution of the individual multi-topic blogs.

[1]  Robert L. Grossman,et al.  GenIc: A Single-Pass Generalized Incremental Algorithm for Clustering , 2004, SDM.

[2]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[3]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[4]  Chung-Kuan Cheng,et al.  Towards efficient hierarchical designs by ratio cut partitioning , 1989, 1989 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers.

[5]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[6]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[7]  Yun Chi,et al.  Discovery of Blog Communities based on Mutual Awareness , 2006 .

[8]  Jianbo Shi,et al.  Segmentation given partial grouping constraints , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[10]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[12]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[13]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[14]  Jaideep Srivastava,et al.  Incremental page rank computation on evolving graphs , 2005, WWW '05.

[15]  Amy Nicole Langville,et al.  Updating pagerank with iterative aggregation , 2004, WWW Alt. '04.

[16]  D. Hindman The Virtual Community: Homesteading on the Electronic Frontier , 1996 .

[17]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.