Identification and correction of spurious spatial correlations in microarray data.

Microarray experiments are provid-ing a huge amount of genome-widedata on gene expression. Many priorexpression analyses have focused oninferring functional relationships (1–7);however, the quality control and nor-malization of the raw data that resultfrom microarrays have received less at-tention. Here we address a systematicerror that arises from microarrays anddiscuss current methods to resolve theproblem.It is well known that the data fromhigh-throughput experiments embody asignificant component of measurementerror that must be removed before anyanalysis can be applied to the data. Anintuitive idea is to repeat the experi-ments and decrease the noise by aver-aging the measurements from repli-cates (8). Unfortunately, microarraysare still difficult to repeat; in most cas-es, researchers do not have many repli-cates for analysis. A Bayesian proba-bilistic approach has been proposed toaddress the problem of the small repeti-tion number for microarray experi-ments (9). While random error can becanceled by replicate experiments, sys-tematic error will not diminish by aver-aging replicates. For example, a notori-ous systematic error in microarrayexperiments is that the expression ratioof a particular gene at different condi-tions is a function of its absolute ex-pression levels. If one uses a simplefold-change cut off, the genes with lowexpression levels tend to numericallymeet the given cut off, even thoughthey are not truly differentially ex-pressed. Different methods have beenproposed to deal with this problem(10–15).In this review, we want to direct at-tention toward a type of systematic er-ror that is manifested by the strong in-teraction between neighboring spots onthe array. If the replicate experimentsare performed on the arrays with same-chip geometry, then these interactionswill not be canceled by the replicates.We will first demonstrate this noise viaa case study, and then we will discussthe possible source of these artifacts.Finally, we will discuss current meth-ods to solve the problem; in particular,a local averaging approach called stan-dardization and normalization of mi-croarray data (SNOMAD) (16). Weexamined several different yeast mi-croarray data sets: diauxic shift, α-fac-tor-arrested cell cycle, cdc15-arrestedcell cycle, and cdc28-arrested cell cycle(17–19).To demonstrate the artifact in themicroarray data, we offer the followingevidence. The relationship betweengene expression and physical chip dis-tance can be revealed by comparing thechip distance map (Figure 1A) to an ex-pression correlation coefficient map(Figure 1B). The horizontal and verti-cal axes of these two maps representthe positions of the genes along a chro-mosome. The colors on the distanceand correlation maps represent the chipdistance and expression correlation co-efficient between gene pairs, respec-tively. Interestingly, the highly correlat-ed gene expression regions (Figure 1B,red blocks) always correspond to theshort chip distance regions (Figure 1A,red blocks), which suggests that themajor reason why two genes are detect-ed to be co-expressed is that thesegenes are located near each other on thechip. We also calculated the average cor-relation coefficient of gene expressionprofiles as a function of the physicalchip distance between two genes. Fig-ure 2 shows the result for a microarraydata set of the yeast α-arrested cell cy-cle. Without an artifact, the averagecorrelation coefficient should be inde-pendent of the chip distance. However,Figure 2 shows that the closer twogenes are on the chip, the higher theiraverage correlation coefficient is. Thisindicates that this data set contains alarge proportion of artifacts. Actually,

[1]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[2]  Ken W. Y. Cho,et al.  Microarray optimizations: increasing spot accuracy and automated identification of true microarray signals. , 2002, Nucleic acids research.

[3]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[4]  J. Michael Cherry,et al.  Microarray data quality analysis: lessons from the AFGC project , 2004, Plant Molecular Biology.

[5]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[6]  Jonathan Pevsner,et al.  Local mean normalization of microarray element signal intensities across an array surface: quality control and correction of spatially systematic artifacts. , 2002, BioTechniques.

[7]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[8]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[10]  G. Church,et al.  A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression , 2000, Nature Genetics.

[11]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[12]  D. Botstein,et al.  Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth , 2000, Nature.

[13]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[14]  M. Oh,et al.  Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. , 2001, Nucleic acids research.

[15]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[16]  Andreas Rytz,et al.  The limit fold change model: A practical approach for selecting differentially expressed genes from microarray data , 2002, BMC Bioinformatics.

[17]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[18]  P. Törönen,et al.  Analysis of gene expression data using self‐organizing maps , 1999, FEBS letters.

[19]  M. R. Fielden,et al.  GP3: GenePix post-processing program for automated analysis of raw microarray data , 2002, Bioinform..

[20]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.