Improved biclustering of microarray data demonstrated through systematic performance tests

A new algorithm is presented for fitting the plaid model, a biclustering method developed for clustering gene expression data. The approach is based on speedy individual differences clustering and uses binary least squares to update the cluster membership parameters, making use of the binary constraints on these parameters and simplifying the other parameter updates. The performance of both algorithms is tested on simulated data sets designed to imitate (normalised) gene expression data, covering a range of biclustering configurations. Empirical distributions for the components of these data sets, including non-systematic error, are derived from a real set of microarray data. A set of two-way quality measures is proposed, based on one-way measures commonly used in information retrieval, to evaluate the quality of a retrieved bicluster with respect to a target bicluster in terms of both genes and samples. By defining a one-to-one correspondence between target biclusters and retrieved biclusters, the performance of each algorithm can be assessed. The results show that, using appropriately selected starting criteria, the proposed algorithm out-performs the original plaid model algorithm across a range of data sets. Furthermore, through the rigorous assessment of the plaid model a benchmark for future evaluation of biclustering methods is established.

[1]  J. Carroll,et al.  An alternating combinatorial optimization approach to fitting the INDCLUS and generalized INDCLUS models , 1994 .

[2]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Nir Friedman,et al.  Context-Specific Bayesian Clustering for Gene Expression Data , 2002, J. Comput. Biol..

[4]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[5]  Roger N. Shepard,et al.  Additive clustering: Representation of similarities as combinations of discrete overlapping properties. , 1979 .

[6]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.

[7]  Korbinian Strimmer,et al.  Modeling gene expression measurement error: a quasi-likelihood approach , 2003, BMC Bioinformatics.

[8]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[9]  B. Mirkin Additive clustering and qualitative factor analysis methods for similarity matrices , 1987 .

[10]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[11]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[12]  M. Lee An Extraction and Regularization Approach to Additive Clustering , 1999 .

[13]  ScienceDirect Computational statistics & data analysis , 1983 .

[14]  P. Arabie,et al.  Mapclus: A mathematical programming approach to fitting the adclus model , 1980 .

[15]  Joydeep Ghosh,et al.  Relationship-based clustering and cluster ensembles for high-dimensional data mining , 2002 .

[16]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[17]  Jonathan Pevsner,et al.  SNOMAD (Standardization and NOrmalization of MicroArray Data): web-accessible gene expression data analysis , 2002, Bioinform..

[18]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[19]  Michael D. Lee,et al.  A Simple Method for Generating Additive Clustering Models with Limited Complexity , 2002, Machine Learning.

[20]  Joshua B. Tenenbaum,et al.  Learning the Structure of Similarity , 1995, NIPS.

[21]  Aidong Zhang,et al.  Interrelated two-way clustering: an unsupervised approach for gene expression data analysis , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[22]  Terence P. Speed,et al.  Normalization for cDNA microarry data , 2001, SPIE BiOS.

[23]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[24]  Ralf Herwig,et al.  Simulation of DNA array hybridization experiments and evaluation of critical parameters during subsequent image and data analysis , 2002, BMC Bioinformatics.

[25]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[26]  Daphne Koller,et al.  Decomposing Gene Expression into Cellular Processes , 2002, Pacific Symposium on Biocomputing.

[27]  W. DeSarbo Gennclus: New models for general nonhierarchical clustering analysis , 1982 .

[28]  M. J. van der Laan,et al.  Statistical inference for simultaneous clustering of gene expression data. , 2002, Mathematical biosciences.

[29]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[30]  Scott L. Zeger,et al.  Snomad: Biologist-Friendly Web Tools for the Standardization and NOrmalization of Microarray Data , 2003 .

[31]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[32]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.