Biclustering in gene expression data by tendency

The advent of DNA microarray technologies has revolutionized the experimental study of gene expression. Clustering is the most popular approach of analyzing gene expression data and has indeed proven to be successful in many applications. Our work focuses on discovering a subset of genes which exhibit similar expression patterns along a subset of conditions in the gene expression matrix. Specifically, we are looking for the order preserving clusters (OP-cluster), in each of which a subset of genes induce a similar linear ordering along a subset of conditions. The pioneering work of the OPSM model, which enforces the strict order shared by the genes in a cluster, is included in our model as a special case. Our model is more robust than OPSM because similarly expressed conditions are allowed to form order equivalent groups and no restriction is placed on the order within a group. Guided by our model, we design and implement a deterministic algorithm, namely OPC-tree, to discover OP-clusters. Experimental study on two real datasets demonstrates the effectiveness of the algorithm in the application of tissue classification and cell cycle identification. In addition, a large percentage of OP-clusters exhibit significant enrichment of one or more function categories, which implies that OP-clusters indeed carry significant biological relevance.

[1]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[2]  J. V. Ryzin,et al.  Clustering Algorithms@@@Cluster Analysis Algorithms@@@Classification and Clustering , 1981 .

[3]  M. Kendall,et al.  Rank Correlation Methods , 1949 .

[4]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[5]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[6]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[7]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[8]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[9]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[10]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[11]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[13]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[14]  Nir Friedman,et al.  Class discovery in gene expression data , 2001, RECOMB.

[15]  Martin Vingron,et al.  Identifying splits with clear separation: a new class discovery method for gene expression data , 2001, ISMB.

[16]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[17]  Lani F. Wu,et al.  Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters , 2002, Nature Genetics.

[18]  Jarkko Venna,et al.  Analysis and visualization of gene expression data using Self-Organizing Maps , 2002, Neural Networks.

[19]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[20]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem. , 2003 .

[21]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.