论文信息 - CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition

CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition

The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m X n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.

[1] Inderjit S. Dhillon,et al. Information-theoretic co-clustering , 2003, KDD '03.

[2] Xin Liu,et al. Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[3] Inderjit S. Dhillon,et al. A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[4] Petros Drineas,et al. FAST MONTE CARLO ALGORITHMS FOR MATRICES III: COMPUTING A COMPRESSED APPROXIMATE MATRIX DECOMPOSITION∗ , 2004 .

[5] J. Hartigan. Direct Clustering of a Data Matrix , 1972 .

[6] Christos Faloutsos,et al. Fully automatic cross-associations , 2004, KDD.

[7] Anthony K. H. Tung,et al. Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8] Naftali Tishby,et al. Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[9] Gert R. G. Lanckriet,et al. Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. , 2005, Genome research.

[10] Prabhakar Raghavan,et al. Competitive recommendation systems , 2002, STOC '02.

[11] Chris H. Q. Ding,et al. A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[12] Chris H. Q. Ding,et al. Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[13] Jimeng Sun,et al. Less is More: Compact Matrix Decomposition for Large Sparse Graphs , 2007, SDM.

[14] Philip S. Yu,et al. Co-clustering by block value decomposition , 2005, KDD '05.

[15] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[16] Chris H. Q. Ding,et al. On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering , 2005, SDM.

[17] George M. Church,et al. Biclustering of Expression Data , 2000, ISMB.

[18] Tao Li,et al. A general model for clustering binary data , 2005, KDD '05.

[19] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.