Finding Additive Biclusters with Random Background

The biclustering problem has been extensively studied in many areas including e-commerce, data mining, machine learning, pattern recognition, statistics, and more recently in computational biology. Given an n×mmatrix A(ni¾? m), the main goal of biclustering is to identify a subset of rows (called objects) and a subset of columns (called properties) such that some objective function that specifies the quality of the found bicluster (formed by the subsets of rows and of columns of A) is optimized. The problem has been proved or conjectured to be NP-hard under various mathematical models. In this paper, we study a probabilistic model of the implanted additive bicluster problem, where each element in the n×mbackground matrix is a random number from [0, Li¾? 1], and a k×kimplanted additive bicluster is obtained from an error-free additive bicluster by randomly changing each element to a number in [0, Li¾? 1] with probability i¾?. We propose an O(n2m) time voting algorithm to solve the problem. We show that for any constant i¾?such that $(1-\delta)(1-\theta)^2 -\frac 1 L >0$, when $k \ge \max \left\{\frac 8 \alpha \sqrt{n\log n},~ \frac {8 \log n} c + \log (2L)\right\}$, where cis a constant number, the voting algorithm can correctly find the implanted bicluster with probability at least $1 - \frac{9}{n^{2}}$. We also implement our algorithm as a software tool for finding novel biclusters in microarray gene expression data, called VOTE. The implementation incorporates several nontrivial ideas for estimating the size of an implanted bicluster, adjusting the threshold in voting, dealing with small biclusters, and dealing with multiple (and overlapping) implanted biclusters. Our experimental results on both simulated and real datasets show that VOTE can find biclusters with a high accuracy and speed.

[1]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[2]  Lusheng Wang,et al.  Computing the maximum similarity bi-clusters of gene expression data , 2007, Bioinform..

[3]  Ludek Kucera,et al.  Expected Complexity of Graph Partitioning Problems , 1995, Discret. Appl. Math..

[4]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[5]  Wojciech Szpankowski,et al.  Finding biclusters by random projections , 2006, Theor. Comput. Sci..

[6]  U. Feige,et al.  Finding and certifying a large hidden clique in a semirandom graph , 2000 .

[7]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[8]  Bin Ma,et al.  On the closest string and substring problems , 2002, JACM.

[9]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[11]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[12]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Noga Alon,et al.  Finding a large hidden clique in a random graph , 1998, SODA '98.

[14]  Ron Shamir,et al.  EXPANDER – an integrative program suite for microarray data analysis , 2005, BMC Bioinformatics.

[15]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[16]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[17]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[18]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[19]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[20]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[21]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[22]  Tao Jiang,et al.  A General Framework for Biclustering Gene Expression Data , 2006, J. Bioinform. Comput. Biol..

[23]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[24]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.