论文信息 - A scalable framework for discovering coherent co-clusters in noisy data

A scalable framework for discovering coherent co-clusters in noisy data

Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of non-informative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally, since clusters could exist in different subspaces of the feature space, a co-clustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional "one-sided" clustering. We propose Robust Overlapping Co-Clustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently mining dense, arbitrarily positioned, possibly overlapping co-clusters from large, noisy datasets. ROCC has several desirable properties that make it extremely well suited to a number of real life applications.

Inderjit S. Dhillon | Joydeep Ghosh | Gunjan Gupta | Hyuk Cho | Meghana Deodhar

[1] S. Ramaswamy,et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[2] George M. Church,et al. Biclustering of Expression Data , 2000, ISMB.

[3] Shailesh V. Date,et al. A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[4] J. H. Ward. Hierarchical Grouping to Optimize an Objective Function , 1963 .

[5] Carla E. Brodley,et al. Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[6] T. M. Murali,et al. Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[7] Philip S. Yu,et al. Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[8] Inderjit S. Dhillon,et al. Information-theoretic co-clustering , 2003, KDD '03.

[9] I. Dhillon,et al. Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10] Lothar Thiele,et al. A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[11] Inderjit S. Dhillon,et al. Robust Overlapping Co-clustering , 2008 .

[12] Inderjit S. Dhillon,et al. A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[13] Richard M. Karp,et al. Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[14] Sven Bergmann,et al. Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15] Roded Sharan,et al. Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[16] Anil K. Jain,et al. Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Arlindo L. Oliveira,et al. Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18] L. Lazzeroni. Plaid models for gene expression data , 2000 .

[19] Luca Benini,et al. Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20] Joydeep Ghosh,et al. Bregman Bubble Clustering: A Robust, Scalable Framework for Locating Multiple, Dense Regions in Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21] D. Botstein,et al. Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[22] Anthony K. H. Tung,et al. Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[23] Huan Liu,et al. Subspace clustering for high dimensional data: a review , 2004, SKDD.

[24] Inderjit S. Dhillon,et al. A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[25] ThieleLothar,et al. A systematic comparison and evaluation of biclustering methods for gene expression data , 2006 .

[26] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[27] Wei Wang,et al. Mining Approximate Order Preserving Clusters in the Presence of Noise , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[28] Eckart Zitzler,et al. BicAT: a biclustering analysis toolbox , 2006, Bioinform..