A scalable framework for discovering coherent co-clusters in noisy data

Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of non-informative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally, since clusters could exist in different subspaces of the feature space, a co-clustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional "one-sided" clustering. We propose Robust Overlapping Co-Clustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently mining dense, arbitrarily positioned, possibly overlapping co-clusters from large, noisy datasets. ROCC has several desirable properties that make it extremely well suited to a number of real life applications.

[1]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[2]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[3]  Shailesh V. Date,et al.  A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[4]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[5]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[6]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[7]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[8]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[9]  I. Dhillon,et al.  Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[11]  Inderjit S. Dhillon,et al.  Robust Overlapping Co-clustering , 2008 .

[12]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[13]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[14]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[16]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[19]  Luca Benini,et al.  Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Joydeep Ghosh,et al.  Bregman Bubble Clustering: A Robust, Scalable Framework for Locating Multiple, Dense Regions in Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[22]  Anthony K. H. Tung,et al.  Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[23]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[24]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[25]  ThieleLothar,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006 .

[26]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[27]  Wei Wang,et al.  Mining Approximate Order Preserving Clusters in the Presence of Noise , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[28]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..