Efficient Semi-supervised Spectral Co-clustering with Constraints

Co-clustering was proposed to simultaneously cluster objects and features to explore inter-correlated patterns. For example, by analyzing the blog click-through data, one finds the group of users who are interested in a specific group of blogs in order to perform applications such as recommendations. However, it is usually very difficult to achieve good co-clustering quality by just analyzing the object-feature correlation data due to the sparsity of the data and the noise. Meanwhile, one may have some prior knowledge that indicates the internal structure of the co-clusters. For instance, one may find user cluster information from the social network system, and the blog-blog similarity from the social tags or contents. This prior information provides some supervision toward the co-cluster structures, and may help reduce the effect of sparsity and noise. However, most co-clustering algorithms do not use this information and may produce unmeaningful results. In this paper we study the problem of finding the optimal co-clusters when some objects and features are believed to be in the same cluster a priori. A matrix decomposition based approach is proposed to formulate as a trace minimization problem, and solve it efficiently with the selected eigenvectors. The asymptotic complexity of the proposed approach is the same as co-clustering without constraints. Experiments include graph-pattern co-clustering and document-word co-clustering. For instance, in graph-pattern data set, the proposed model can improve the normalized mutual information by as much as 5.5 times and 10 times faster than two naive solutions that expand the edges and vertices in the graphs.

[1]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Ruggero G. Pensa,et al.  Towards Constrained Co-clustering in Ordered 0/1 Data Sets , 2006, ISMIS.

[3]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[4]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[5]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.