Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization

BackgroundThe DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only [1]. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis.ResultsWe develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm.ConclusionWe have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.

[1]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[2]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[3]  Hong Yan,et al.  Cluster analysis of gene expression data based on self-splitting and merging competitive learning , 2004, IEEE Transactions on Information Technology in Biomedicine.

[4]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[5]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[6]  Jessie B. Kennedy,et al.  Coordinated graph and scatter-plot views for the visual exploration of microarray time-series data , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[7]  Matthew O. Ward,et al.  Interactive hierarchical displays: a general framework for visualization and exploration of large multivariate data sets , 2003, Comput. Graph..

[8]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[9]  Lai-Wan Chan,et al.  Biclustering Gene Expression Profiles by Alternately Sorting with Weighted Correlated Coefficient , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[10]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[12]  Ben Shneiderman,et al.  Dynamic querying for pattern identification in microarray and genomic data , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[13]  Alfred Inselberg,et al.  Parallel coordinates for visualizing multi-dimensional geometry , 1987 .

[14]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[15]  M. Cooper,et al.  Visual data analysis using tracked statistical measures within parallel coordinate representations , 2005, Coordinated and Multiple Views in Exploratory Visualization (CMV'05).

[16]  Hong Yan,et al.  A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data. , 2008, Journal of theoretical biology.

[17]  Luca Benini,et al.  Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[19]  Luca Benini,et al.  Enhanced pClustering and its applications to gene expression data , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[20]  Hong Yan,et al.  Gene expression data clustering and visualization based on a binary hierarchical clustering framework , 2003 .

[21]  Hong Yan,et al.  Discovering biclusters in gene expression data based on high-dimensional linear geometries , 2008, BMC Bioinformatics.

[22]  E. Wegman Hyperdimensional Data Analysis Using Parallel Coordinates , 1990 .

[23]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[24]  E. Winzeler,et al.  Genomics, gene expression and DNA arrays , 2000, Nature.

[25]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[26]  Alfred Inselberg,et al.  Parallel coordinates: a tool for visualizing multi-dimensional geometry , 1990, Proceedings of the First IEEE Conference on Visualization: Visualization `90.

[27]  Matthew O. Ward,et al.  Clutter Reduction in Multi-Dimensional Data Visualization Using Dimension Reordering , 2004, IEEE Symposium on Information Visualization.

[28]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[29]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[30]  Tangirala Venkateswara Prasad,et al.  Visualization of microarray gene expression data , 2006, Bioinformation.

[31]  Jeffrey T. Chang,et al.  Basic microarray analysis: grouping and feature reduction. , 2001, Trends in biotechnology.

[32]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Daniel L. Hartl,et al.  GeneMerge - Post-genomic Analysis, Data Mining, and Hypothesis Testing , 2003, Bioinform..

[34]  Roded Sharan,et al.  CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[35]  Wan-Chi Siu,et al.  BiVisu: software tool for bicluster detection and visualization , 2007, Bioinform..