Subspace Weighting Co-Clustering of Gene Expression Data

Microarray technology enables the collection of vast amounts of gene expression data from biological experiments. Clustering algorithms have been successfully applied to exploring the gene expression data. Since a set of genes may be only correlated to a subset of samples, it is useful to use co-clustering to recover co-clusters in the gene expression data. In this paper, we propose a novel algorithm, called Subspace Weighting Co-Clustering (SWCC), for high dimensional gene expression data. In SWCC, a gene subspace weight matrix is introduced to identify the contribution of gene objects in distinguishing different sample clusters. We design a new co-clustering objective function to recover the co-clusters in the gene expression data, in which the subspace weight matrix is introduced. An iterative algorithm is developed to solve the objective function, in which the subspace weight matrix is automatically computed during the iterative co-clustering process. Our empirical study shows encouraging results of the proposed algorithm in comparison with six state-of-the-art clustering algorithms on ten gene expression data sets. We also propose to use SWCC for gene clustering and selection. The experimental results show that the selected genes can improve the classification performance of Random Forests.

[1]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[2]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[3]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[5]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Tie-Yan Liu,et al.  Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering , 2005, KDD '05.

[7]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[8]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[9]  Yunming Ye,et al.  A feature group weighting method for subspace clustering of high-dimensional data , 2012, Pattern Recognit..

[10]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[11]  Ruggero G. Pensa,et al.  Constrained Co-clustering of Gene Expression Data , 2008, SDM.

[12]  Inderjit S. Dhillon,et al.  A scalable framework for discovering coherent co-clusters in noisy data , 2009, ICML '09.

[13]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[14]  Gérard Govaert,et al.  An EM algorithm for the block mixture model , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Furu Wei,et al.  Constrained Text Coclustering with Supervised and Unsupervised Constraints , 2013, IEEE Transactions on Knowledge and Data Engineering.

[16]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[17]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Yunming Ye,et al.  Feature Weighting Information-Theoretic Co-Clustering for Document Clustering , 2009, 2009 2nd International Conference on Computer Science and its Applications.

[19]  Yunming Ye,et al.  TW-k-means: Automated two-level variable weighting clustering algorithm for multiview data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[20]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[21]  Wei Cheng,et al.  HICC: an entropy splitting-based framework for hierarchical co-clustering , 2015, Knowledge and Information Systems.

[22]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[23]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[24]  Weixiang Liu,et al.  An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification , 2011, Comput. Biol. Medicine.

[25]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[26]  William-Chandra Tjhi,et al.  Flexible Fuzzy Co-clustering with Feature-cluster Weighting , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[27]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Zhen Ji,et al.  PK-means: A new algorithm for gene clustering , 2008, Comput. Biol. Chem..

[29]  Mustapha Lebbah,et al.  Feature Group Weighting and Topological Biclustering , 2014, ICONIP.

[30]  Han Li,et al.  A Resampling Based Clustering Algorithm for Replicated Gene Expression Data , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[32]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[33]  D. Duffy,et al.  A permutation-based algorithm for block clustering , 1991 .

[34]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[35]  Chun Chen,et al.  Locally Discriminative Coclustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[36]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.