Methods for Cluster Analysis and Validation in Microarray Gene Expression Data

Motivation. Unsupervised learning or clustering is frequently used to explore gene expression profiles for insight into both regulation and function. However, the quality of clustering results is often difficult to assess and each algorithm has tunable parameters with often no obvious way to choose appropriate values. Most algorithms also require the number of clusters to be predetermined yet this value is rarely known and, thus, is arrived at by subjective criteria. Here we present a method to systematically address these challenges using statistical evaluation. Method. The method presented compares the quality of clustering results in order to choose the most appropriate algorithm, distance metric and number of clusters for gene network discovery using objective criteria. In brief, two quality assessment metrics are used: the Consensus Share (CS) and the Feature Configuration Statistic (FCS). CS is the percentage of genes (not gene pairs) that are identically clustered in several clusterings and FCS is a measure of randomness of the observed configuration of transcription factor binding sites among clustered genes. Results. We evaluate this method using both artificial and yeast microarray data. By choosing parameters settings that minimize FCS values and maximize CS values we show major advantages over other clustering methods in particular for identifying combinatorially regulated groups of genes. The results produced provide remarkable enrichment for cis-regulatory elements in clusters of genes known to be regulated by such elements and evidence of extensive combinatorial regulation. Moreover, the method can be generalized when prior information about cis-regulatory sites is absent or it is desirable to calculate FCS values based on functional categorization.

[1]  Christopher Leckie,et al.  An Evaluation of Criteria for Measuring the Quality of Clusters , 1999, IJCAI.

[2]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Dirk Repsilber,et al.  Developing and Testing Methods for Microarray Data Analysis Using an Artificial Life Framework , 2003, ECAL.

[4]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[5]  J. Pronk,et al.  Reproducibility of Oligonucleotide Microarray Transcriptome Analyses , 2002, The Journal of Biological Chemistry.

[6]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[7]  Raj Acharya,et al.  An information theoretic approach for analyzing temporal patterns of gene expression , 2003, Bioinform..

[8]  R. Fisher,et al.  The Logic of Inductive Inference , 1935 .

[9]  E. Ben-Jacob Bacterial wisdom, Gödel's theorem and creative genomic webs , 1998 .

[10]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[11]  Ole Winther,et al.  Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm , 2006, Bioinform..

[12]  Howard J. Hamilton,et al.  Knowledge discovery and measures of interest , 2001 .

[13]  K. Kwast,et al.  Genomic Analyses of Anaerobically Induced Genes in Saccharomyces cerevisiae: Functional Roles of Rox1 and Other Factors in Mediating the Anoxic Response , 2002, Journal of bacteriology.

[14]  K. Kwast,et al.  Dynamical Remodeling of the Transcriptome during Short-Term Anaerobiosis in Saccharomyces cerevisiae: Differential Response and Role of Msn2 and/or Msn4 and Other Factors in Galactose and Glucose Media , 2005, Molecular and Cellular Biology.

[15]  Andrew K. C. Wong,et al.  Entropy and Distance of Random Graphs with Application to Structural Pattern Recognition , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  Harold N. Gabow,et al.  Data structures for weighted matching and nearest common ancestors with linking , 1990, SODA '90.

[18]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[19]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[20]  Gerhard Wanner,et al.  The role of pheromones in bacterial interactions. , 1996 .

[21]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[22]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[23]  K. Ramachandran,et al.  Mathematical Statistics with Applications. , 1992 .

[24]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[25]  Q. Wang,et al.  A nonlinear correlation measure for multivariable data set , 2005 .

[26]  M. Eisen,et al.  Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering , 2002, Genome Biology.

[27]  Harry Joe,et al.  A remark on algorithm 643: FEXACT: an algorithm for performing Fisher's exact test in r x c contingency tables , 1993, TOMS.

[28]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[29]  B. Crespi The evolution of social behavior in microorganisms. , 2001, Trends in ecology & evolution.

[30]  Xiaohui Liu,et al.  Consensus clustering and functional interpretation of gene-expression data , 2004, Genome Biology.

[31]  J. Lin,et al.  A NEW DIRECTED DIVERGENCE MEASURE AND ITS CHARACTERIZATION , 1990 .

[32]  L. Fulton,et al.  Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting , 2003, Science.

[33]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[34]  Steven Skiena,et al.  Integrating microarray data by consensus clustering , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[35]  Dennis J. Michaud,et al.  eXPatGen: Generating Dynamic Expression Patterns for the Systematic Evaluation of Analytical Methods , 2003, Bioinform..

[36]  H. Levine,et al.  Bacterial linguistic communication and social intelligence. , 2004, Trends in microbiology.

[37]  Lu Tian,et al.  Comparative analysis of gene sets in the gene ontology space under the multiple hypothesis testing framework , 2004 .

[38]  Jan Treur,et al.  Putting intentions into cell biochemistry: an artificial intelligence perspective. , 2002, Journal of theoretical biology.

[39]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[40]  M Levin,et al.  The evolution of understanding: a genetic algorithm model of the evolution of communication. , 1995, Bio Systems.

[41]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[42]  K. Davies,et al.  Induction and repression of DAN1 and the family of anaerobic mannoprotein genes in Saccharomyces cerevisiae occurs through a complex array of regulatory sites. , 2001, Nucleic acids research.

[43]  George C Tseng,et al.  Tight Clustering: A Resampling‐Based Approach for Identifying Stable and Tight Patterns in Data , 2005, Biometrics.

[44]  G. Meek Mathematical statistics with applications , 1973 .

[45]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[46]  A. Tinkelenberg,et al.  Transcriptional Profiling Identifies Two Members of the ATP-binding Cassette Transporter Superfamily Required for Sterol Uptake in Yeast* , 2002, The Journal of Biological Chemistry.

[47]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[48]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[49]  Stefan Hougardy,et al.  A simple approximation algorithm for the weighted matching problem , 2003, Inf. Process. Lett..

[50]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[51]  Terrance G. Cooper,et al.  Complilation and characteristics of dedicated transcription factors in Saccharomyces cerevisiae , 1995 .

[52]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[53]  E. Keller The Century of the Gene , 2000 .

[54]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[55]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[56]  Michael A. Beer,et al.  Whole-genome discovery of transcription factor binding sites by network-level conservation. , 2003, Genome research.

[57]  C. Lowry,et al.  Regulation of gene expression by oxygen in Saccharomyces cerevisiae. , 1992, Microbiological reviews.

[58]  P. Brazhnik,et al.  Gene networks: how to put the function in genomics. , 2002, Trends in biotechnology.

[59]  Claude E. Shannon,et al.  The Mathematical Theory of Communication , 1950 .

[60]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[61]  K. Kwast,et al.  Oxygen sensing and the transcriptional regulation of oxygen-responsive genes in yeast. , 1998, The Journal of experimental biology.

[62]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.