Quality Control and Robust Estimation for cDNA Microarrays With Replicates

We consider robust estimation of gene intensities from cDNA microarray data with replicates. Several statistical methods for estimating gene intensities from microarrays have been proposed, but little work has been done on robust estimation. This is particularly relevant for experiments with replicates, because even one outlying replicate can have a disastrous effect on the estimated intensity for the gene concerned. Because of the many steps involved in the experimental process from hybridization to image analysis, cDNA microarray data often contain outliers. For example, an outlying data value could occur because of scratches or dust on the surface, imperfections in the glass, or imperfections in the array production. We develop a Bayesian hierarchical model for robust estimation of cDNA microarray intensities. Outliers are modeled explicitly using a t-distribution, and our model also addresses such classical issues as design effects, normalization, transformation, and nonconstant variance. Parameter estimation is carried out using Markov chain Monte Carlo. By identifying potential outliers, the method provides automatic quality control of replicate, array, and gene measurements. The method is applied to three publicly available gene expression datasets and compared with three other methods: ANOVA-normalized log ratios, the median log ratio, and estimation after the removal of outliers based on Dixon's test. We find that the between-replicate variability of the intensity estimates is lower for our method than for any of the others. We also address the issue of whether the background should be subtracted when estimating intensities. It has been argued that this should not be done because it increases variability, whereas the arguments for doing so are that there is a physical basis for the image background, and that not doing so will bias downward the estimated log ratios of differentially expressed genes. We show that the arguments on both sides of this debate are correct for our data, but that by using our model one can have the best of both worlds: One can subtract the background without increasing variability by much.

[1]  David M. Rocke,et al.  A Model for Measurement Error for Gene Expression Arrays , 2001, J. Comput. Biol..

[2]  G. Churchill Fundamentals of experimental design for cDNA microarrays , 2002, Nature Genetics.

[3]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[4]  P. Brown,et al.  Combining SSH and cDNA microarrays for rapid identification of differentially expressed genes. , 1999, Nucleic acids research.

[5]  Pascal Wild,et al.  Fitting Bayesian multiple random effects models , 1996, Stat. Comput..

[6]  Terence P. Speed,et al.  Comparison of Methods for Image Analysis on cDNA Microarray Data , 2002 .

[7]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[8]  M. Stephens Dealing with label switching in mixture models , 2000 .

[9]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[10]  J. Tukey On the Comparative Anatomy of Transformations , 1957 .

[11]  James J. Chen,et al.  Analysis of variance components in gene expression data , 2004, Bioinform..

[12]  P. Bickel,et al.  An Analysis of Transformations Revisited , 1981 .

[13]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[14]  Trey Ideker,et al.  Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data , 2000, J. Comput. Biol..

[15]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[16]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Y. Chen,et al.  Ratio-based decisions and the quantitative analysis of cDNA microarray images. , 1997, Journal of biomedical optics.

[18]  C. Robert,et al.  Computational and Inferential Difficulties with Mixture Posterior Distributions , 2000 .

[19]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[20]  G. Churchill,et al.  Experimental design for gene expression microarrays. , 2001, Biostatistics.

[21]  M. Oh,et al.  Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. , 2001, Nucleic acids research.

[22]  J. Q. Smith,et al.  1. Bayesian Statistics 4 , 1993 .

[23]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[24]  P. Sorger,et al.  Image metrics in the statistical analysis of DNA microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[25]  D. Cox,et al.  An Analysis of Transformations Revisited, Rebutted , 1982 .

[26]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[27]  David M. Rocke,et al.  Approximate Variance-stabilizing Transformations for Gene-expression Microarray Data , 2003, Bioinform..

[28]  R. Gottardo,et al.  Statistical analysis of microarray data: a Bayesian approach. , 2003, Biostatistics.

[29]  D. Lindley,et al.  Bayes Estimates for the Linear Model , 1972 .

[30]  A. Raftery,et al.  How Many Iterations in the Gibbs Sampler , 1991 .

[31]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[32]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[33]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[34]  Roger E Bumgarner,et al.  Cellular Gene Expression upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-Cell Lines , 2003, Journal of Virology.

[35]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[36]  K. Roeder,et al.  Journal of the American Statistical Association: Comment , 2006 .

[37]  Chris A. Glasbey,et al.  Combinatorial image analysis of DNA microarray features , 2003, Bioinform..

[38]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[39]  Douglas M. Hawkins,et al.  A variance-stabilizing transformation for gene-expression microarray data , 2002, ISMB.

[40]  Richard Simon,et al.  Questions and answers on design of dual-label microarrays for identifying differentially expressed genes. , 2003, Journal of the National Cancer Institute.

[41]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[42]  W. J. Dixon,et al.  Analysis of Extreme Values , 1950 .

[43]  J. Besag,et al.  Bayesian analysis of agricultural field experiments , 1999 .

[44]  S. Walker Invited comment on the paper "Slice Sampling" by Radford Neal , 2003 .