Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient

BackgroundCurrently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data.ResultsIn this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns.ConclusionThis study shows that SCC is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient, and is particularly useful for clustering replicated microarray data. This computational approach should be generally useful for proteomic data or other high-throughput analysis methodology.

[1]  L. G. Hickok,et al.  The Programming of Sexual Phenotype in the Homosporous Fern Ceratopteris richardii , 1993, International Journal of Plant Sciences.

[2]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[3]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Brian Tjaden,et al.  Information , 2001, The Lancet.

[5]  V. Seagroatt An introduction to medical statistics (2nd ed.) , 1996 .

[6]  Paul Pavlidis,et al.  ErmineJ: Tool for functional analysis of gene expression data sets , 2005, BMC Bioinformatics.

[7]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[8]  K. Furuhashi,et al.  Involvement of Actin Dephosphorylation in Germination of Physarum Sclerotium , 2002, The Journal of eukaryotic microbiology.

[9]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[10]  Kamel Chibani,et al.  Role of Abscisic Acid in Seed Dormancy , 2005, Journal of Plant Growth Regulation.

[11]  Günter Kahl,et al.  SuperSAGE array: the direct use of 26-base-pair transcript tags in oligonucleotide arrays , 2006, Nature Methods.

[12]  Stanley J. Roux,et al.  Ceratopteris richardii: A Productive Model for Revealing Secrets of Signaling and Development , 2000, Journal of Plant Growth Regulation.

[13]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[14]  Ramón Serrano,et al.  Enhancement of Abscisic Acid Sensitivity and Reduction of Water Consumption in Arabidopsis by Combined Inactivation of the Protein Phosphatases Type 2C ABI1 and HAB11[W] , 2006, Plant Physiology.

[15]  D. Cotter,et al.  Glucose-induced pathways for actin tyrosine dephosphorylation during Dictyostelium spore germination. , 2000, Experimental cell research.

[16]  Robert R Klevecz,et al.  A rapid genome-scale response of the transcriptional oscillator to perturbation reveals a period-doubling path to phenotypic change , 2006, Proceedings of the National Academy of Sciences.

[17]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[18]  Da-Peng Zhang,et al.  The Mg-chelatase H subunit is an abscisic acid receptor , 2006, Nature.

[19]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[21]  A. Chatterjee,et al.  Gravity-directed calcium current in germinating spores of Ceratopteris richardii , 2000, Planta.

[22]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[23]  Kui Wang,et al.  A Mixture model with random-effects components for clustering correlated gene-expression profiles , 2006, Bioinform..

[24]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[25]  R. Reski,et al.  Moss systems biology en route: phytohormones in Physcomitrella development. , 2006, Plant biology.

[26]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[27]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[28]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[29]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[30]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[31]  M. Greenwood An Introduction to Medical Statistics , 1932, Nature.

[32]  Raj Acharya,et al.  An information theoretic approach for analyzing temporal patterns of gene expression , 2003, Bioinform..

[33]  T. Golub,et al.  Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. , 2004, Blood.

[34]  Philippe Lucas,et al.  Gene expression analysis by cDNA-AFLP highlights a set of new signaling networks and translational control during seed dormancy breaking in Nicotiana plumbaginifolia , 2005, Plant Molecular Biology.

[35]  Mari L. Salmi,et al.  Profile and Analysis of Gene Expression Changes during Early Development in Germinating Spores of Ceratopteris richardii1[w] , 2005, Plant Physiology.

[36]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[37]  Charles Kung,et al.  Chemical genomic profiling to identify intracellular targets of a multiplex kinase inhibitor. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Carlos Nicolás,et al.  Evidence of a role for tyrosine dephosphorylation in the control of postgermination arrest of development by abscisic acid in Arabidopsis thaliana L , 2005, Planta.

[39]  B. Shaw,et al.  Aspergillus nidulans swoK encodes an RNA binding protein that is important for cell polarity. , 2005, Fungal genetics and biology : FG & B.

[40]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[41]  D. Botstein,et al.  For Personal Use. Only Reproduce with Permission from the Lancet Publishing Group , 2022 .

[42]  B. Mishra,et al.  Shrinkage-based similarity metric for cluster analysis of microarray data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[43]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Gene H. Golub,et al.  Matrix computations , 1983 .

[45]  Carlos Nicolás,et al.  Molecular cloning of a functional protein phosphatase 2C (FsPP2C2) with unusual features and synergistically up-regulated by ABA and calcium in dormant seeds of Fagus sylvatica. , 2002, Physiologia plantarum.

[46]  Gavin Sherlock,et al.  The Stanford Microarray Database: implementation of new analysis tools and open source release of software , 2002, Nucleic Acids Res..

[47]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[48]  Gavin Sherlock,et al.  The Longhorn Array Database (LAD): An Open-Source, MIAME compliant implementation of the Stanford Microarray Database (SMD) , 2003, BMC Bioinformatics.

[49]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[50]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[51]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[53]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[54]  E. Rubin,et al.  Genome-wide requirements for Mycobacterium tuberculosis adaptation and survival in macrophages. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Ayuko Kuwahara,et al.  Gibberellin Biosynthesis and Response during Arabidopsis Seed Germination Online version contains Web-only data. Article, publication date, and citation information can be found at www.plantcell.org/cgi/doi/10.1105/tpc.011650. , 2003, The Plant Cell Online.

[56]  M. Plante,et al.  Presence of small-nuclear-ribonucleoprotein-containing nuclear bodies in quiescent and early germinatingZea mays embryos , 1998, Protoplasma.

[57]  Eberhard Schnepf,et al.  Brachycytes in funaria protonemate: Induction by abscisic acid and fine structure , 1997 .