Statistical Challenges in Preprocessing in Microarray Experiments in Cancer

Many clinical studies incorporate genomic experiments to investigate the potential associations between high-dimensional molecular data and clinical outcome. A critical first step in the statistical analyses of these experiments is that the molecular data are preprocessed. This article provides an overview of preprocessing methods, including summary algorithms and quality control metrics for microarrays. Some of the ramifications and effects that preprocessing methods have on the statistical results are illustrated. The discussions are centered around a microarray experiment based on lung cancer tumor samples with survival as the clinical outcome of interest. The procedures that are presented focus on the array platform used in this study. However, many of these issues are more general and are applicable to other instruments for genome-wide investigation. The discussions here will provide insight into the statistical challenges in preprocessing microarrays used in clinical studies of cancer. These challenges should not be viewed as inconsequential nuisances but rather as important issues that need to be addressed so that informed conclusions can be drawn.

[1]  Fred A. Wright,et al.  Theoretical and experimental comparisons of gene expression indexes for oligonucleotide arrays , 2002, Bioinform..

[2]  R. Simon,et al.  Use of genomic signatures in therapeutics development in oncology and other diseases , 2006, The Pharmacogenomics Journal.

[3]  Cheng Li,et al.  Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application , 2001, Genome Biology.

[4]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[5]  J. McClintick,et al.  Reproducibility of oligonucleotide arrays using small samples , 2003, BMC Genomics.

[6]  M. Dugas,et al.  Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis , 2002, Genome Biology.

[7]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[8]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[9]  Stefano Monti,et al.  Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[10]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[11]  Benjamin M. Bolstad,et al.  Preprocessing High-density Oligonucleotide Arrays , 2005 .

[12]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy-number changes using cDNA microarrays , 1999, Nature Genetics.

[13]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Rafael A. Irizarry,et al.  Stochastic models inspired by hybridization theory for short oligonucleotide arrays , 2004, J. Comput. Biol..

[15]  Terence P. Speed,et al.  A benchmark for Affymetrix GeneChip expression measures , 2004, Bioinform..

[16]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[17]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[18]  Howard J. Edenberg,et al.  Effects of filtering by Present call on analysis of microarray experiments , 2006, BMC Bioinformatics.

[19]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[20]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[21]  Richard Simon,et al.  The Use of Genomics in Clinical Trial Design , 2008, Clinical Cancer Research.

[22]  Rafael A. Irizarry,et al.  Comparison of Affymetrix GeneChip expression measures , 2006, Bioinform..

[23]  D J Lockhart,et al.  Genome-wide detection of allelic imbalance using human SNPs and high-density DNA arrays. , 2000, Genome research.

[24]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Leslie Lamport,et al.  Latex : A Document Preparation System , 1985 .

[26]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[27]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[28]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[29]  William D. Figg,et al.  Validation of Analytic Methods for Biomarkers Used in Drug Development , 2008, Clinical Cancer Research.

[30]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[31]  Leslie Lamport,et al.  L A T E X (2nd ed.): a document preparation system: user's guide and reference manual , 1994 .

[32]  Rafael A. Irizarry,et al.  A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database , 2006, BMC Bioinformatics.

[33]  Stephen L George,et al.  Statistical Issues in Translational Cancer Research , 2008, Clinical Cancer Research.

[34]  James M. Olson,et al.  Assessment of the relationship between pre-chip and post-chip quality measures for Affymetrix GeneChip expression data , 2006, BMC Bioinformatics.

[35]  E. Wit Design and Analysis of DNA Microarray Investigations , 2004, Human Genomics.

[36]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[37]  Eric P. Hoffman,et al.  Probe set algorithms: is there a rational best bet? , 2006, BMC Bioinformatics.

[38]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[39]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[40]  P. Collins,et al.  Performance comparison of one-color and two-color platforms within the Microarray Quality Control (MAQC) project , 2006, Nature Biotechnology.

[41]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[42]  R. Simon,et al.  Adaptive Signature Design: An Adaptive Clinical Trial Design for Generating and Prospectively Testing A Gene Expression Signature for Sensitive Patients , 2005, Clinical Cancer Research.

[43]  Wei-Min Liu,et al.  Robust estimators for expression analysis , 2002, Bioinform..

[44]  Jeremy MG Taylor,et al.  Validation of Biomarker-Based Risk Prediction Models , 2008, Clinical Cancer Research.

[45]  Kouros Owzar,et al.  A multiple testing procedure to associate gene expression levels with survival , 2005, Statistics in medicine.

[46]  Mayte Suárez-Fariñas,et al.  Harshlight: a "corrective make-up" program for microarray chips , 2005, BMC Bioinformatics.