Co‐clustering numerical data under user‐defined constraints

In the generic setting of objects × attributes matrix data analysis, co-clustering appears as an interesting unsupervised data mining method. A co-clustering task provides a bi-partition made of coclusters: each co-cluster is a group of objects associated to a group of attributes and these associations can support expert interpretations. Many constrained clustering algorithms have been proposed to exploit the domain knowledge and to improve partition relevancy in the mono-dimensional clustering case (e.g., using the must-link and cannot-link constraints on one of the two dimensions). Here, we consider constrained co-clustering not only for extended must-link and cannot-link constraints (i.e., both objects and attributes can be involved), but also for interval constraints that enforce properties of co-clusters when considering ordered domains. We describe an iterative co-clustering algorithm which exploits user-defined constraints while minimizing a given objective function. Thanks to a generic setting, we emphasize that different objective functions can be used. The added value of our approach is demonstrated on both synthetic and real data. Among others, several experiments illustrate the practical impact of this original co-clustering setting in the context of gene expression data analysis, and in an original application to a protein motif discovery problem.

[1]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[2]  Arindam Banerjee,et al.  Bayesian Co-clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  Esko Ukkonen,et al.  Discovering Patterns and Subfamilies in Biosequences , 1996, ISMB.

[4]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[5]  K. White,et al.  Patterns of Gene Expression During Drosophila Mesoderm Development , 2001, Science.

[6]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[7]  Anirban Dasgupta,et al.  Approximation algorithms for co-clustering , 2008, PODS.

[8]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[9]  Inderjit S. Dhillon,et al.  A scalable framework for discovering coherent co-clusters in noisy data , 2009, ICML '09.

[10]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[11]  Joydeep Ghosh,et al.  Scalable Clustering Algorithms with Balancing Constraints , 2006, Data Mining and Knowledge Discovery.

[12]  B. S. Baker,et al.  Gene Expression During the Life Cycle of Drosophila melanogaster , 2002, Science.

[13]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[14]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[15]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[17]  Ruggero G. Pensa,et al.  Constrained Co-clustering of Gene Expression Data , 2008, SDM.

[18]  Ruggero G. Pensa,et al.  Constraint-driven Co-Clustering of 0/1 Data , 2008 .

[19]  Céline Robardet,et al.  Efficient Local Search in Conceptual Clustering , 2001, Discovery Science.

[20]  Rong Ge,et al.  Constraint-driven clustering , 2007, KDD '07.

[21]  Ruggero G. Pensa,et al.  Towards Constrained Co-clustering in Ordered 0/1 Data Sets , 2006, ISMIS.

[22]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[23]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[24]  Marco Botta,et al.  A new protein motif extraction framework based on constrained co-clustering , 2009, SAC '09.

[25]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[26]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[27]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[28]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[29]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[30]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[31]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[32]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[33]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[34]  Ruggero G. Pensa,et al.  Parameter-Free Hierarchical Co-clustering by n-Ary Splits , 2009, ECML/PKDD.

[35]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[36]  Tamara G. Kolda,et al.  Partitioning Rectangular and Structurally Unsymmetric Sparse Matrices for Parallel Processing , 1999, SIAM J. Sci. Comput..

[37]  Joydeep Ghosh,et al.  On Scaling Up Balanced Clustering Algorithms , 2002, SDM.

[38]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.