Simultaneous Heterogeneous Data Clustering Based on Higher Order Relationships

Co-clustering on heterogeneous data has attracted more and more attention in web mining and information retrieval. The clustering approaches for two type heterogeneous data (bi-type co-clustering) have been well studied in the lit- erature. However, the work on data with more than two types (high-order co-clustering or multi-type co-clustering) is still limited. In this paper, we present a multi-type co- clustering algorithm, which clusters the data from differ- ent types simultaneously. We use a higher-order tensor to model the high-order relationships, each element of which describes the relation (similarity) among a given set com- posed by data objects from every types. Based on the high- order relationships, we embed the multi-type data objects into the low dimensional spaces by the algorithm based on Clique Expansion which can be viewed as a high-order extension of the normalized cut approach. At last, the k- means method is used to cluster the lower dimensional data. Experiment results show the effectiveness of the proposed method on both toy problem and real data.

[1]  William R. Hersh,et al.  Evaluation of biomedical text-mining systems: Lessons learned from information retrieval , 2005, Briefings Bioinform..

[2]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[3]  Huan Liu,et al.  CubeSVD: a novel approach to personalized Web search , 2005, WWW '05.

[4]  Serge J. Belongie,et al.  Higher order learning with graphs , 2006, ICML.

[5]  Edward A. Fox,et al.  SimFusion: measuring similarity using unified relationship matrix , 2005, SIGIR '05.

[6]  Joos Vandewalle,et al.  On the Best Rank-1 and Rank-(R1 , R2, ... , RN) Approximation of Higher-Order Tensors , 2000, SIAM J. Matrix Anal. Appl..

[7]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[8]  Alexander K. Seewald,et al.  GPSDB: a new database for synonyms expansion of gene and protein names , 2005, Bioinform..

[9]  David G. Stork,et al.  Pattern Classification , 1973 .

[10]  Zhang Changshui,et al.  Reply networks on a bulletin board system. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  B. Schölkopf,et al.  A Regularization Framework for Learning from Graph Data , 2004, ICML 2004.

[12]  Tie-Yan Liu,et al.  Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering , 2005, KDD '05.

[13]  Joos Vandewalle,et al.  A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[14]  Philip S. Yu,et al.  Co-clustering by block value decomposition , 2005, KDD '05.

[15]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[16]  Walter V. Sujansky,et al.  Heterogeneous Database Integration in Biomedicine , 2001, J. Biomed. Informatics.

[17]  Zhang-Zhi Hu,et al.  The iProClass integrated database for protein functional analysis , 2004, Comput. Biol. Chem..

[18]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[19]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[20]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[21]  Philip S. Yu,et al.  Spectral clustering for multi-type relational data , 2006, ICML.

[22]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[23]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[24]  Peer Kröger,et al.  A Computational Biology Database Digest: Data, Data Analysis, and Data Management , 2004, Distributed and Parallel Databases.

[25]  Zheng Chen,et al.  Latent semantic analysis for multiple-type interrelated data objects , 2006, SIGIR.

[26]  Toshihisa Takagi,et al.  Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. , 2003, Genome research.

[27]  Scott W. Hadley,et al.  Approximation Techniques for Hypergraph Partitioning Problems , 1995, Discret. Appl. Math..

[28]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.