A comprehensive sensitivity analysis of microarray breast cancer classification under feature variability

BackgroundLarge discrepancies in signature composition and outcome concordance have been observed between different microarray breast cancer expression profiling studies. This is often ascribed to differences in array platform as well as biological variability. We conjecture that other reasons for the observed discrepancies are the measurement error associated with each feature and the choice of preprocessing method. Microarray data are known to be subject to technical variation and the confidence intervals around individual point estimates of expression levels can be wide. Furthermore, the estimated expression values also vary depending on the selected preprocessing scheme. In microarray breast cancer classification studies, however, these two forms of feature variability are almost always ignored and hence their exact role is unclear.ResultsWe have performed a comprehensive sensitivity analysis of microarray breast cancer classification under the two types of feature variability mentioned above. We used data from six state of the art preprocessing methods, using a compendium consisting of eight diferent datasets, involving 1131 hybridizations, containing data from both one and two-color array technology. For a wide range of classifiers, we performed a joint study on performance, concordance and stability. In the stability analysis we explicitly tested classifiers for their noise tolerance by using perturbed expression profiles that are based on uncertainty information directly related to the preprocessing methods. Our results indicate that signature composition is strongly influenced by feature variability, even if the array platform and the stratification of patient samples are identical. In addition, we show that there is often a high level of discordance between individual class assignments for signatures constructed on data coming from different preprocessing schemes, even if the actual signature composition is identical.ConclusionFeature variability can have a strong impact on breast cancer signature composition, as well as the classification of individual patient samples. We therefore strongly recommend that feature variability is considered in analyzing data from microarray breast cancer expression profiling experiments.

[1]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[2]  P. Stafford,et al.  Three methods for optimization of cross-laboratory and cross-platform microarray expression data , 2007, Nucleic acids research.

[3]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[4]  P. Hall,et al.  An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Y. Tu,et al.  Quantitative noise analysis for gene expression microarray experiments , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  L. Holmberg,et al.  Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts , 2005, Breast Cancer Research.

[8]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[9]  Neil D. Lawrence,et al.  Probe-level measurement error improves accuracy in detecting differential gene expression , 2006, Bioinform..

[10]  Seon-Young Kim,et al.  Effects of sample size on robustness and prediction accuracy of a prognostic gene signature , 2009, BMC Bioinformatics.

[11]  Benjamin M. Bolstad,et al.  affy - analysis of Affymetrix GeneChip data at the probe level , 2004, Bioinform..

[12]  James J. Chen,et al.  Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data , 2007, BMC Bioinformatics.

[13]  Peng Liang MAQC papers over the cracks , 2007, Nature Biotechnology.

[14]  David G. Stork,et al.  Pattern Classification , 1973 .

[15]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[16]  Rafael A. Irizarry,et al.  Comparison of Affymetrix GeneChip expression measures , 2006, Bioinform..

[17]  Gordon K. Smyth,et al.  A comparison of background correction methods for two-colour microarrays , 2007, Bioinform..

[18]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[19]  A. Yakovlev,et al.  How high is the level of technical noise in microarray data? , 2007, Biology Direct.

[20]  Neil D. Lawrence,et al.  puma: a Bioconductor package for propagating uncertainty in microarray analysis , 2009, BMC Bioinformatics.

[21]  Ajay N. Jain,et al.  Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. , 2006, Cancer cell.

[22]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[23]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[24]  J. Bergh,et al.  Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[25]  Neil D. Lawrence,et al.  A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips , 2005, Bioinform..

[26]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[27]  C. P. Hong,et al.  Sequenced BAC anchored reference genetic map that reconciles the ten individual chromosomes of Brassica rapa , 2009, BMC Genomics.

[28]  Marcel J. T. Reinders,et al.  The effect of oligonucleotide microarray data pre-processing on the analysis of patient-cohort studies , 2006, BMC Bioinformatics.

[29]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[30]  Daniel Q. Naiman,et al.  Classifying Gene Expression Profiles from Pairwise mRNA Comparisons , 2004, Statistical applications in genetics and molecular biology.

[31]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[32]  Fabien Reyal,et al.  Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability , 2008, BMC Genomics.

[33]  M Milo,et al.  A probabilistic model for the extraction of expression levels from oligonucleotide arrays. , 2003, Biochemical Society transactions.

[34]  Carlos Caldas,et al.  A comprehensive analysis of prognostic signatures reveals the high predictive capacity of the Proliferation, Immune response and RNA splicing modules in breast cancer , 2008, Breast Cancer Research.

[35]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[36]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[37]  M. Dugas,et al.  Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis , 2002, Genome Biology.

[38]  R. Irizarry,et al.  Consolidated strategy for the analysis of microarray spike-in data , 2008, Nucleic acids research.

[39]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[40]  Hongyue Dai,et al.  Rosetta error model for gene expression analysis , 2006, Bioinform..

[41]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[42]  Neil D. Lawrence,et al.  Accounting for probe-level noise in principal component analysis of microarray data , 2005, Bioinform..

[43]  Dhammika Amaratunga,et al.  Exploration and Analysis of DNA Microarray and Protein Array Data , 2003, Wiley series in probability and statistics.

[44]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[45]  Ben Bolstad,et al.  Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization , 2003 .

[46]  Wolfgang Schmidt,et al.  Early iron-deficiency-induced transcriptional changes in Arabidopsis roots as revealed by microarray analyses , 2009, BMC Genomics.

[47]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[48]  Sergio Contrino,et al.  ArrayExpress—a public repository for microarray gene expression data at the EBI , 2004, Nucleic Acids Res..

[49]  Marcel J. T. Reinders,et al.  A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets , 2006, BMC Bioinformatics.

[50]  Rudolph S. Parrish,et al.  BMC Bioinformatics BioMed Central Research article Sources of variation in Affymetrix microarray experiments , 2005 .

[51]  Andy J. Minn,et al.  Genes that mediate breast cancer metastasis to lung , 2005, Nature.

[52]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[53]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[54]  Cor J. Veenman,et al.  A protocol for building and evaluating predictors of disease state based on microarray data , 2005, Bioinform..

[55]  Cheng Li,et al.  Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application , 2001, Genome Biology.

[56]  Patrick Kemmeren,et al.  Multiple robust signatures for detecting lymph node metastasis in head and neck cancer. , 2006, Cancer research.

[57]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[58]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[59]  J. Bergh,et al.  Strong Time Dependence of the 76-Gene Prognostic Signature for Node-Negative Breast Cancer Patients in the TRANSBIG Multicenter Independent Validation Series , 2007, Clinical Cancer Research.

[60]  Terence P. Speed,et al.  A benchmark for Affymetrix GeneChip expression measures , 2004, Bioinform..

[61]  David P. Kreil,et al.  There is no silver bullet - a guide to low-level data transforms and normalisation methods for microarray data , 2005, Briefings Bioinform..

[62]  Neil D. Lawrence,et al.  Propagating uncertainty in microarray data analysis , 2006, Briefings Bioinform..

[63]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.