Finding biclusters by random projections

Given a matrix X composed of symbols, a bicluster is a submatrix of X obtained by removing some of the rows and some of the columns of X in such a way that each row of what is left reads the same string. In this paper, we are concerned with the problem of finding the bicluster with the largest area in a large matrix X. The problem is first proved to be NP-complete. We present a fast and efficient randomized algorithm that discovers the largest bicluster by random projections. A detailed probabilistic analysis of the algorithm and an asymptotic study of the statistical significance of the solutions are given. We report results of extensive simulations on synthetic data.

[1]  Avraham A. Melkman,et al.  Sleeved coclustering , 2004, KDD '04.

[2]  Dorit S. Hochbaum,et al.  Approximating Clique and Biclique Problems , 1998, J. Algorithms.

[3]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[4]  Wojciech Szpankowski,et al.  Average Case Analysis of Algorithms on Sequences: Szpankowski/Average , 2001 .

[5]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[6]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[7]  Song Zhu,et al.  A new clustering method for microarray data analysis , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[8]  Milind Dawande,et al.  On Bipartite and Multipartite Clique Problems , 2001, J. Algorithms.

[9]  Philip S. Yu,et al.  Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[10]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[11]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[12]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.

[13]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[16]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[17]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[18]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[19]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[20]  Jinze Liu,et al.  Biclustering in gene expression data by tendency , 2004 .

[21]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[22]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[23]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[24]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[25]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[26]  Wojciech Szpankowski,et al.  Biclustering gene-feature matrices for statistically significant dense patterns , 2004 .

[27]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem. , 2003 .

[28]  Dana Ron,et al.  On Finding Large Conjunctive Clusters , 2003, COLT.

[29]  Michelangelo Grigni,et al.  On the Complexity of the Generalized Block Distribution , 1996, IRREGULAR.

[30]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[31]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.