Low-Rank Kernel Matrix Factorization for Large-Scale Evolutionary Clustering

Traditional clustering techniques are inapplicable to problems where the relationships between data points evolve over time. Not only is it important for the clustering algorithm to adapt to the recent changes in the evolving data, but it also needs to take the historical relationship between the data points into consideration. In this paper, we propose ECKF, a general framework for evolutionary clustering large-scale data based on low-rank kernel matrix factorization. To the best of our knowledge, this is the first work that clusters large evolutionary data sets by the amalgamation of low-rank matrix approximation methods and matrix factorization-based clustering. Since the low-rank approximation provides a compact representation of the original matrix, and especially, the near-optimal low-rank approximation can preserve the sparsity of the original data, ECKF gains computational efficiency and hence is applicable to large evolutionary data sets. Moreover, matrix factorization-based methods have been shown to effectively cluster high-dimensional data in text mining and multimedia data analysis. From a theoretical standpoint, we mathematically prove the convergence and correctness of ECKF, and provide detailed analysis of its computational efficiency (both time and space). Through extensive experiments performed on synthetic and real data sets, we show that ECKF outperforms the existing methods in evolutionary clustering.

[1]  Philip S. Yu,et al.  Colibri: fast mining of large static and dynamic graphs , 2008, KDD.

[2]  Judy Kay,et al.  Clustering and Sequential Pattern Mining of Online Collaborative Learning Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[4]  Shengrui Wang,et al.  Mining Projected Clusters in High-Dimensional Spaces , 2009, IEEE Transactions on Knowledge and Data Engineering.

[5]  Nizar Bouguila,et al.  High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Yanhua Chen,et al.  A matrix-based approach for semi-supervised document co-clustering , 2008, CIKM '08.

[7]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[8]  Jing Hua,et al.  Graph theoretical framework for simultaneously integrating visual and textual features for efficient web image clustering , 2008, WWW.

[9]  Chris H. Q. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering , 2005, SDM.

[10]  Rong Jin,et al.  Active query selection for semi-supervised clustering , 2008, 2008 19th International Conference on Pattern Recognition.

[11]  Robert L. Grossman,et al.  GenIc: A Single-Pass Generalized Incremental Algorithm for Clustering , 2004, SDM.

[12]  Lambert Schomaker,et al.  Text-Independent Writer Identification and Verification Using Textural and Allographic Features , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Miguel Á. Carreira-Perpiñán,et al.  Constrained spectral clustering through affinity propagation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Jimeng Sun,et al.  Less is More: Sparse Graph Mining with Compact Matrix Decomposition , 2008, Stat. Anal. Data Min..

[16]  Greg J. Bloy Blind Camera Fingerprinting and Image Clustering , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[18]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[19]  Mohamed S. Kamel,et al.  Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[21]  Bin Wu,et al.  Community detection in large-scale social networks , 2007, WebKDD/SNA-KDD '07.

[22]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[23]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[24]  Giovanni Soda,et al.  Font adaptive word indexing of modern printed documents , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Athena Vakali,et al.  Time-Aware Web Users' Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[26]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[27]  Witold Pedrycz,et al.  The Development of Incremental Models , 2007, IEEE Transactions on Fuzzy Systems.

[28]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[29]  Weixiong Zhang,et al.  An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[30]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[31]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[32]  C. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and K-means - Spectral Clustering , 2005 .

[33]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[34]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Jing Hua,et al.  Non-negative matrix factorization for semi-supervised data clustering , 2008, Knowledge and Information Systems.

[36]  Keith C. C. Chan,et al.  A Novel Approach for Discovering Overlapping Clusters in Gene Expression Data , 2009, IEEE Transactions on Biomedical Engineering.

[37]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[38]  Gerald Sommer,et al.  An Adaptive Classification Algorithm Using Robust Incremental Clustering , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[39]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[40]  Tao Li,et al.  The Relationships Among Various Nonnegative Matrix Factorization Methods for Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[41]  Yifan Li,et al.  Clustering moving objects , 2004, KDD.

[42]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[43]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[44]  Yun Chi,et al.  Evolutionary spectral clustering by incorporating temporal smoothness , 2007, KDD '07.

[45]  Michael W. Berry,et al.  Algorithm 844: Computing sparse reduced-rank approximations to sparse matrices , 2005, TOMS.

[46]  Florence Forbes,et al.  Gene Clustering via Integrated Markov Models Combining Individual and Pairwise Features , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  Ron Bekkerman,et al.  Multi-modal Clustering for Multimedia Collections , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[49]  Dimitris Achlioptas,et al.  Fast computation of low rank matrix approximations , 2001, STOC '01.

[50]  Chris H. Q. Ding,et al.  Convex and Semi-Nonnegative Matrix Factorizations , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Petros Drineas,et al.  FAST MONTE CARLO ALGORITHMS FOR MATRICES III: COMPUTING A COMPRESSED APPROXIMATE MATRIX DECOMPOSITION∗ , 2004 .

[52]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[53]  Ron Bekkerman,et al.  Semi-supervised Clustering using Combinatorial MRFs , 2006 .

[54]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[55]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[56]  Jing Hua,et al.  Incorporating User Provided Constraints into Document Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[57]  Dimitrios Skoutas,et al.  STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques , 2005, IEEE Transactions on Knowledge and Data Engineering.

[58]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Yihong Gong,et al.  Incremental Spectral Clustering With Application to Monitoring of Evolving Blog Communities , 2007, SDM.

[60]  Xiang Ji,et al.  Document clustering with prior knowledge , 2006, SIGIR.

[61]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[62]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.