论文信息 - Sleeved coclustering

Sleeved coclustering

A coCluster of a m x n matrix X is a submatrix determined by a subset of the rows and a subset of the columns. The problem of finding coClusters with specific properties is of interest, in particular, in the analysis of microarray experiments. In that case the entries of the matrix X are the expression levels of $m$ genes in each of $n$ tissue samples. One goal of the analysis is to extract a subset of the samples and a subset of the genes, such that the expression levels of the chosen genes behave similarly across the subset of the samples, presumably reflecting an underlying regulatory mechanism governing the expression level of the genes.We propose to base the similarity of the genes in a coCluster on a simple biological model, in which the strength of the regulatory mechanism in sample j is Hj, and the response strength of gene i to the regulatory mechanism is Gi. In other words, every two genes participating in a good coCluster should have expression values in each of the participating samples, whose ratio is a constant depending only on the two genes. Noise in the expression levels of genes is taken into account by allowing a deviation from the model, measured by a relative error criterion. The sleeve-width of the coCluster reflects the extent to which entry i,j in the coCluster is allowed to deviate, relatively, from being expressed as the product GiHj.We present a polynomial-time Monte-Carlo algorithm which outputs a list of coClusters whose sleeve-widths do not exceed a prespecified value. Moreover, we prove that the list includes, with fixed probability, a coCluster which is near-optimal in its dimensions. Extensive experimentation with synthetic data shows that the algorithm performs well.

Avraham A. Melkman | Eran Shaham

[1] G. Getz,et al. Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[2] E. Straus,et al. On the approximation of a function of several variables by the sum of functions of fewer variables. , 1951 .

[3] L. Lazzeroni. Plaid models for gene expression data , 2000 .

[4] Roded Sharan,et al. Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[5] R. Sharan,et al. CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[6] Dimitrios Gunopulos,et al. Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[7] Ash A. Alizadeh,et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[8] T. M. Murali,et al. A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[9] Inderjit S. Dhillon,et al. Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[10] Philip S. Yu,et al. Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[11] Philip S. Yu,et al. Fast algorithms for projected clustering , 1999, SIGMOD '99.

[12] Philip S. Yu,et al. Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[13] George M. Church,et al. Biclustering of Expression Data , 2000, ISMB.

[14] Zohar Yakhini,et al. Analysis of Expression Patterns: The Scope of the Problem, the Problem of Scope , 2002, Disease markers.

[15] J. Mesirov,et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[16] Inderjit S. Dhillon,et al. Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[17] T. M. Murali,et al. Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[18] Philip S. Yu,et al. Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.