MIB: Using mutual information for biclustering gene expression data

Result of any biclustering or clustering algorithm depends on the choice of the similarity measure. Most of the biclustering algorithms are based on Euclidean distance or correlation coefficient. These measures capture only linear relationships between the genes but nonlinear dependencies may exist amongst them. In this paper we propose an approach using mutual information for biclustering gene expression data. Mutual information is a more general measure to investigate relationships (positive, negative correlation and nonlinear relationships as well). To the best of our knowledge, none of the existing algorithms for biclustering have used mutual information as a similarity measure between two genes. We obtained biclusters from the gene expression data of Arabidopsis thaliana and compared our biclusters with those obtained by two other algorithms namely ISA and BIMAX. Biological significance of the biclusters was checked using GO database. It was found that the genes belonging to our biclusters were significantly enriched with GO terms with better p values as compared to the genes of the biclusters obtained by the other two algorithms. To further investigate the biclusters, we studied the promoter regions of the genes belonging to a bicluster for common patterns/transcription factor binding sites (TFBS) or motifs. Promoter regions of the genes of most of our biclusters were found to have a common motif patterns which existed in the motif database of Arabidopsis thaliana. Also, the motifs extracted from our biclusters had better E values than those of others. Thus reconfirming that use of mutual information as a similarity measure will produce better biclusters.

[1]  Xiaobo Zhou,et al.  Gene Clustering Based on Clusterwide Mutual Information , 2004, J. Comput. Biol..

[2]  W. Bialek,et al.  Information-based clustering. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[3]  H. W. Koch,et al.  Threshold Measurements on the Nuclear Photo-Effect , 1945 .

[4]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[5]  Neelima Gupta,et al.  Mib: Using Mutual Information for Biclustering High Dimensional Data , 2008, IADIS European Conf. Data Mining.

[6]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[7]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[8]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[9]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[10]  Carsten O. Daub,et al.  The mutual information: Detecting and evaluating dependencies between variables , 2002, ECCB.

[11]  Ned S. Wingreen,et al.  Finding regulatory modules through large-scale gene-expression data analysis , 2003, Bioinform..

[12]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[13]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[14]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[15]  Yoshihiro Ugawa,et al.  Plant cis-acting regulatory DNA elements (PLACE) database: 1999 , 1999, Nucleic Acids Res..

[16]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[17]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[18]  Lusheng Wang,et al.  Computing the maximum similarity bi-clusters of gene expression data , 2007, Bioinform..

[19]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[20]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[21]  Oded Maimon,et al.  Evaluation of gene-expression clustering via mutual information distance measure , 2007, BMC Bioinformatics.

[22]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[24]  Chris Mungall,et al.  AmiGO: online access to ontology and annotation data , 2008, Bioinform..

[25]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[26]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[27]  G S Michaels,et al.  Cluster analysis and data visualization of large-scale gene expression data. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[28]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[29]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[30]  M. Waterman,et al.  Pattern recognition in several sequences: consensus and alignment. , 1984, Bulletin of mathematical biology.