论文信息 - Simultaneous Heterogeneous Data Clustering Based on Higher Order Relationships

Simultaneous Heterogeneous Data Clustering Based on Higher Order Relationships

Co-clustering on heterogeneous data has attracted more and more attention in web mining and information retrieval. The clustering approaches for two type heterogeneous data (bi-type co-clustering) have been well studied in the lit- erature. However, the work on data with more than two types (high-order co-clustering or multi-type co-clustering) is still limited. In this paper, we present a multi-type co- clustering algorithm, which clusters the data from differ- ent types simultaneously. We use a higher-order tensor to model the high-order relationships, each element of which describes the relation (similarity) among a given set com- posed by data objects from every types. Based on the high- order relationships, we embed the multi-type data objects into the low dimensional spaces by the algorithm based on Clique Expansion which can be viewed as a high-order extension of the normalized cut approach. At last, the k- means method is used to cluster the lower dimensional data. Experiment results show the effectiveness of the proposed method on both toy problem and real data.

Changshui Zhang | Fei Wang | Shouchun Chen

[1] William R. Hersh,et al. Evaluation of biomedical text-mining systems: Lessons learned from information retrieval , 2005, Briefings Bioinform..

[2] Inderjit S. Dhillon,et al. Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[3] Huan Liu,et al. CubeSVD: a novel approach to personalized Web search , 2005, WWW '05.

[4] Serge J. Belongie,et al. Higher order learning with graphs , 2006, ICML.

[5] Edward A. Fox,et al. SimFusion: measuring similarity using unified relationship matrix , 2005, SIGIR '05.

[6] Joos Vandewalle,et al. On the Best Rank-1 and Rank-(R1 , R2, ... , RN) Approximation of Higher-Order Tensors , 2000, SIAM J. Matrix Anal. Appl..

[7] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[8] Alexander K. Seewald,et al. GPSDB: a new database for synonyms expansion of gene and protein names , 2005, Bioinform..

[9] David G. Stork,et al. Pattern Classification , 1973 .

[10] Zhang Changshui,et al. Reply networks on a bulletin board system. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11] B. Schölkopf,et al. A Regularization Framework for Learning from Graph Data , 2004, ICML 2004.

[12] Tie-Yan Liu,et al. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering , 2005, KDD '05.

[13] Joos Vandewalle,et al. A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[14] Philip S. Yu,et al. Co-clustering by block value decomposition , 2005, KDD '05.

[15] Joydeep Ghosh,et al. Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[16] Walter V. Sujansky,et al. Heterogeneous Database Integration in Biomedicine , 2001, J. Biomed. Informatics.

[17] Zhang-Zhi Hu,et al. The iProClass integrated database for protein functional analysis , 2004, Comput. Biol. Chem..

[18] Hongfang Liu,et al. BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[19] Bernhard Schölkopf,et al. Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[20] Hagit Shatkay,et al. Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[21] Philip S. Yu,et al. Spectral clustering for multi-type relational data , 2006, ICML.

[22] Alfonso Valencia,et al. Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[23] Limsoon Wong,et al. Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[24] Peer Kröger,et al. A Computational Biology Database Digest: Data, Data Analysis, and Data Management , 2004, Distributed and Parallel Databases.

[25] Zheng Chen,et al. Latent semantic analysis for multiple-type interrelated data objects , 2006, SIGIR.

[26] Toshihisa Takagi,et al. Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. , 2003, Genome research.

[27] Scott W. Hadley,et al. Approximation Techniques for Hypergraph Partitioning Problems , 1995, Discret. Appl. Math..

[28] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.