Constrained Co-clustering of Gene Expression Data

In many applications, the expert interpretation of co-clustering is easier than for mono-dimensional clustering. Co-clustering aims at computing a bi-partition that is a collection of co-clusters: each co-cluster is a group of objects associated to a group of attributes and these associations can support interpretations. Many constrained clustering algorithms have been proposed to exploit the domain knowledge and to improve partition relevancy in the mono-dimensional case (e.g., using the so-called must-link and cannot-link constraints). Here, we consider constrained co-clustering not only for extended must-link and cannot-link constraints (i.e., both objects and attributes can be involved), but also for interval constraints that enforce properties of co-clusters when considering ordered domains. We propose an iterative co-clustering algorithm which exploits user-defined constraints while minimizing the sum-squared residues, i.e., an objective function introduced for gene expression data clustering by Cho et al (2004). We illustrate the added value of our approach in two applications that concern gene expression data analysis.

[1]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[2]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[3]  L. Hubert,et al.  Comparing partitions , 1985 .

[4]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[5]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[6]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[7]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[8]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[9]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[10]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[11]  E. Salmon Gene Expression During the Life Cycle of Drosophila melanogaster , 2002 .

[12]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[13]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Ruggero G. Pensa,et al.  Constraint-driven Co-Clustering of 0/1 Data , 2008 .

[15]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[16]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[17]  Céline Robardet,et al.  Efficient Local Search in Conceptual Clustering , 2001, Discovery Science.

[18]  J. Derisi,et al.  The Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum , 2003, PLoS biology.

[19]  Ruggero G. Pensa,et al.  Towards Constrained Co-clustering in Ordered 0/1 Data Sets , 2006, ISMIS.

[20]  Rong Ge,et al.  Constraint-driven clustering , 2007, KDD '07.

[21]  Joydeep Ghosh,et al.  On Scaling Up Balanced Clustering Algorithms , 2002, SDM.

[22]  Joydeep Ghosh,et al.  Scalable Clustering Algorithms with Balancing Constraints , 2006, Data Mining and Knowledge Discovery.

[23]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[24]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[25]  GusfieldDan Introduction to the IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2004 .

[26]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.