Biclustering Methods: Biological Relevance and Application in Gene Expression Analysis

DNA microarray technologies are used extensively to profile the expression levels of thousands of genes under various conditions, yielding extremely large data-matrices. Thus, analyzing this information and extracting biologically relevant knowledge becomes a considerable challenge. A classical approach for tackling this challenge is to use clustering (also known as one-way clustering) methods where genes (or respectively samples) are grouped together based on the similarity of their expression profiles across the set of all samples (or respectively genes). An alternative approach is to develop biclustering methods to identify local patterns in the data. These methods extract subgroups of genes that are co-expressed across only a subset of samples and may feature important biological or medical implications. In this study we evaluate 13 biclustering and 2 clustering (k-means and hierarchical) methods. We use several approaches to compare their performance on two real gene expression data sets. For this purpose we apply four evaluation measures in our analysis: (1) we examine how well the considered (bi)clustering methods differentiate various sample types; (2) we evaluate how well the groups of genes discovered by the (bi)clustering methods are annotated with similar Gene Ontology categories; (3) we evaluate the capability of the methods to differentiate genes that are known to be specific to the particular sample types we study and (4) we compare the running time of the algorithms. In the end, we conclude that as long as the samples are well defined and annotated, the contamination of the samples is limited, and the samples are well replicated, biclustering methods such as Plaid and SAMBA are useful for discovering relevant subsets of genes and samples.

[1]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.

[2]  Jaakko Astola,et al.  Comparison of Affymetrix data normalization methods using 6,926 experiments across five array generations , 2009, BMC Bioinformatics.

[3]  Ron Shamir,et al.  EXPANDER – an integrative program suite for microarray data analysis , 2005, BMC Bioinformatics.

[4]  Li Teng,et al.  Discovering Biclusters by Iteratively Sorting with Weighted Correlation Coefficient in Gene Expression Data , 2008, J. Signal Process. Syst..

[5]  Frederick P. Roth,et al.  Next generation software for functional trend analysis , 2009, Bioinform..

[6]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[8]  Philip S. Yu,et al.  An Improved Biclustering Method for Analyzing Gene Expression Profiles , 2005, Int. J. Artif. Intell. Tools.

[9]  Eckart Zitzler,et al.  BicAT: a biclustering analysis toolbox , 2006, Bioinform..

[10]  Jing Xiao,et al.  An Efficient Voting Algorithm for Finding Additive Biclusters with Random Background , 2008, J. Comput. Biol..

[11]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[12]  Chris Sander,et al.  CancerGenes: a gene selection resource for cancer genome projects , 2006, Nucleic Acids Res..

[13]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[14]  Wojtek J. Krzanowski,et al.  Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[15]  Jiong Yang,et al.  Gene ontology friendly biclustering of expression profiles , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[16]  Andrea Califano,et al.  Analysis of Gene Expression Microarrays for Phenotype Classification , 2000, ISMB.

[17]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[18]  Jun S Liu,et al.  Bayesian biclustering of gene expression data , 2008, BMC Genomics.

[19]  Waseem Ahmad,et al.  cHawk : An Efficient Biclustering Algorithm based on Bipartite Graph Crossing Minimization , 2007 .

[20]  Yutao Fu,et al.  Gene expression module discovery using gibbs sampling. , 2004, Genome informatics. International Conference on Genome Informatics.

[21]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Jin-Kao Hao,et al.  A biclustering algorithm based on a Bicluster Enumeration Tree: application to DNA microarray data , 2009, BioData Mining.

[23]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[24]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[25]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[26]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[27]  Roberto Therón,et al.  Methods to Bicluster Validation and Comparison in Microarray Data , 2007, IDEAL.

[28]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[29]  Jiang Qian,et al.  TiGER: A database for tissue-specific gene expression and regulation , 2008, BMC Bioinformatics.

[30]  Panos M. Pardalos,et al.  Biclustering in data mining , 2008, Comput. Oper. Res..

[31]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[33]  Paul Horton,et al.  Exhaustive Search Method of Gene Expression Modules and Its Application to Human Tissue Data , 2007 .

[34]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[35]  Srinivas Aluru,et al.  Handbook Of Computational Molecular Biology , 2010 .

[36]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[37]  Tim Van den Bulcke Robust Algorithms for Inferring of Regulatory Networks Based on Gene Expression Measurements and Biological Prior Information (Robuuste algoritmes voor de inferentie van regulatorische netwerken op basis van expressiemetingen en biologische prior informatie) , 2009 .

[38]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[39]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[40]  David J. Reiss,et al.  Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks , 2006, BMC Bioinformatics.

[41]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[42]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[43]  Samuel Kaski,et al.  Hierarchical Generative Biclustering for MicroRNA Expression Analysis , 2010, RECOMB.

[44]  Luca Benini,et al.  Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[45]  Li Fu,et al.  Clustering Algorithms for Gene Expression Analysis , 2005 .

[46]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Aidong Zhang,et al.  Interrelated two-way clustering: an unsupervised approach for gene expression data analysis , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[48]  J. Astola,et al.  Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues , 2008, Genome Biology.

[49]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[50]  S. Kaski,et al.  Bayesian biclustering with the plaid model , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.

[51]  Edmund J Crampin,et al.  Biclustering reveals breast cancer tumour subgroups with common clinical features and improves prediction of disease recurrence , 2013, BMC Genomics.

[52]  Friedrich Leisch,et al.  A toolbox for bicluster analysis in R , 2008 .

[53]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[54]  Lusheng Wang,et al.  Computing the maximum similarity bi-clusters of gene expression data , 2007, Bioinform..

[55]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2010 .

[56]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[57]  Li Li,et al.  A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data , 2012, BioData Mining.

[58]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[59]  R. Tothill,et al.  Novel Molecular Subtypes of Serous and Endometrioid Ovarian Cancer Linked to Clinical Outcome , 2008, Clinical Cancer Research.

[60]  Jarkko Venna,et al.  Analysis and visualization of gene expression data using Self-Organizing Maps , 2002, Neural Networks.

[61]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[62]  Tao Jiang,et al.  A General Framework for Biclustering Gene Expression Data , 2006, J. Bioinform. Comput. Biol..

[63]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[64]  Simon Kasif,et al.  GEMS: a web server for biclustering analysis of expression data , 2005, Nucleic Acids Res..

[65]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.