Bregman Bubble Co-clustering

Clustering problems often involve datasets where only a part of the data is relevant to the problem e.g. in microarray data analysis only a subset of the genes show interesting patterns within a subset of the conditions(features). On such datasets, in order to accurately identify meaningful clusters, the non-informative data points should be automatically detected and pruned and non-discriminative features should be simultaneously discarded. Additionally, since clusters could exist in different subspaces of the feature space, a clustering algorithm that is capable of identifying clusters along multiple axes is more suitable as compared to one that is restricted to traditional “one-sided” clustering. We propose Bregman Bubble coclustering (BBCC), an approach that generalizes both, Bregman bubble clustering [GG06] and Bregman co-clustering [BDG07] to a scalable and very versatile framework. BBCC works with a large variety of distance measures and different co-cluster definitions, making it applicable to a wide range of real life datasets. We also provide insights into a soft version of our algorithm and the underlying generative model. We further extend BBCC to address the problem of efficiently detecting arbitrarily positioned, possibly overlapping co-clusters in a dataset and combine it with a novel model selection strategy that also automatically determines the appropriate number of co-clusters. We highlight the effectiveness of our approach through extensive experimentation on synthetic as well as real datasets. We show that BBCC not only performs better than traditional approaches on microarray datasets but even improves upon human curated techniques in a completely unsupervised manner.

[1]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[2]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  Inderjit S. Dhillon,et al.  Effect of Data Transformation on Residue , 2007 .

[5]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[6]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[7]  H. Law Research methods for multimode data analysis , 1984 .

[8]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[9]  Desire L. Massart,et al.  Effect of different preprocessing methods for principal component analysis applied to the composition of mixtures: Detection of impurities in HPLC—DAD , 1994 .

[10]  Joydeep Ghosh,et al.  Robust one-class clustering using hybrid global and local search , 2005, ICML.

[11]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[12]  Joseph T. Chang,et al.  Spectral biclustering of microarray cancer data : co-clustering genes and conditions , 2003 .

[13]  Deepak Agarwal,et al.  Predictive discrete latent factor models for large scale dyadic data , 2007, KDD '07.

[14]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[15]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[16]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[18]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[19]  Shailesh V. Date,et al.  A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[20]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[21]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[22]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[23]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[24]  Carlotta Domeniconi,et al.  Weighted Clustering Ensembles , 2006, SDM.

[25]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[26]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[27]  R. A. Harshman,et al.  Data preprocessing and the extended PARAFAC model , 1984 .

[28]  Jian-Hui Jiang,et al.  Bubble agglomeration algorithm for unsupervised classification: a new clustering methodology without a priori information , 2005 .

[29]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[30]  Koby Crammer,et al.  A needle in a haystack: local one-class optimization , 2004, ICML.

[31]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[32]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[33]  Suvrit Sra,et al.  Minimum Sum-Squared Residue based clustering of Gene Expression Data , 2004 .

[34]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[35]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[36]  Joydeep Ghosh,et al.  Bregman Bubble Clustering: A Robust, Scalable Framework for Locating Multiple, Dense Regions in Data , 2006, Sixth International Conference on Data Mining (ICDM'06).