Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context

BackgroundDifferent microarray studies have compiled gene lists for predicting outcomes of a range of treatments and diseases. These have produced gene lists that have little overlap, indicating that the results from any one study are unstable. It has been suggested that the underlying pathways are essentially identical, and that the expression of gene sets, rather than that of individual genes, may be more informative with respect to prognosis and understanding of the underlying biological process.ResultsWe sought to examine the stability of prognostic signatures based on gene sets rather than individual genes. We classified breast cancer cases from five microarray studies according to the risk of metastasis, using features derived from predefined gene sets. The expression levels of genes in the sets are aggregated, using what we call a set statistic. The resulting prognostic gene sets were as predictive as the lists of individual genes, but displayed more consistent rankings via bootstrap replications within datasets, produced more stable classifiers across different datasets, and are potentially more interpretable in the biological context since they examine gene expression in the context of their neighbouring genes in the pathway. In addition, we performed this analysis in each breast cancer molecular subtype, based on ER/HER2 status. The prognostic gene sets found in each subtype were consistent with the biology based on previous analysis of individual genes.ConclusionsTo date, most analyses of gene expression data have focused at the level of the individual genes. We show that a complementary approach of examining the data using predefined gene sets can reduce the noise and could provide increased insight into the underlying biological pathways.

[1]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[2]  M. Reinders,et al.  Module-Based Outcome Prediction Using Breast Cancer Compendia , 2007, PloS one.

[3]  Benjamin M. Bolstad,et al.  affy - analysis of Affymetrix GeneChip data at the probe level , 2004, Bioinform..

[4]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[5]  J. Downward Targeting RAS signalling pathways in cancer therapy , 2003, Nature Reviews Cancer.

[6]  Joshy George,et al.  Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. , 2006, Cancer research.

[7]  Jean D. Gibbons,et al.  Concepts of Nonparametric Theory , 1981 .

[8]  Conrad Sanderson,et al.  An Efficient Alternative to SVM Based Recursive Feature Elimination with Applications in Natural Language Processing and Bioinformatics , 2006, Australian Conference on Artificial Intelligence.

[9]  R. Gelber,et al.  Prediction of cancer outcome with microarrays , 2005, The Lancet.

[10]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[12]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[13]  B. Silverman,et al.  Functional Data Analysis , 1997 .

[14]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[15]  Seon-Young Kim,et al.  A gene sets approach for identifying prognostic gene signatures for outcome prediction , 2008, BMC Genomics.

[16]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[17]  John D. Storey,et al.  Optimality Driven Nearest Centroid Classification from Genomic Data , 2007, PloS one.

[18]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[19]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[20]  F. Clark,et al.  Understanding alternative splicing: towards a cellular code , 2005, Nature Reviews Molecular Cell Biology.

[21]  Trevor Hastie,et al.  Averaged gene expressions for regression. , 2007, Biostatistics.

[22]  Eytan Domany,et al.  Outcome Signature Genes in Breast Cancer: Is There a Unique Set? , 2022 .

[23]  A. Nobel,et al.  Concordance among Gene-Expression – Based Predictors for Breast Cancer , 2011 .

[24]  J. Bergh,et al.  Strong Time Dependence of the 76-Gene Prognostic Signature for Node-Negative Breast Cancer Patients in the TRANSBIG Multicenter Independent Validation Series , 2007, Clinical Cancer Research.

[25]  J. Mosley,et al.  Cell cycle correlated genes dictate the prognostic power of breast cancer gene lists , 2008, BMC Medical Genomics.

[26]  E. Steyerberg,et al.  [Regression modeling strategies]. , 2011, Revista espanola de cardiologia.

[27]  Gianluca Bontempi,et al.  Biological Processes Associated with Breast Cancer Clinical Outcome Depend on the Molecular Subtypes , 2008, Clinical Cancer Research.

[28]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[29]  R. Tibshirani,et al.  Statistical Applications in Genetics and Molecular Biology Pre-validation and inference in microarrays , 2011 .

[30]  Schumacher Martin,et al.  Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples , 2008 .

[31]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[32]  Marcel J. T. Reinders,et al.  A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets , 2006, BMC Bioinformatics.

[33]  Harald Binder,et al.  Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples , 2008, Statistical applications in genetics and molecular biology.

[34]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[35]  Qing Wang,et al.  Towards precise classification of cancers based on robust gene functional expression profiles , 2005, BMC Bioinformatics.

[36]  Carlos Caldas,et al.  A comprehensive analysis of prognostic signatures reveals the high predictive capacity of the Proliferation, Immune response and RNA splicing modules in breast cancer , 2008, Breast Cancer Research.

[37]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[38]  J. Baak,et al.  Prognostic value of proliferation in invasive breast cancer: a review , 2004, Journal of Clinical Pathology.

[39]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[40]  Marcio Luis Acencio,et al.  The generation and utilization of a cancer-oriented representation of the human transcriptome by using expressed sequence tags , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Benjamin Haibe-Kains,et al.  A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? , 2008, Bioinform..

[42]  Yi Zhang,et al.  Pathway analysis of gene signatures predicting metastasis of node-negative primary breast cancer , 2007, BMC Cancer.

[43]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[44]  A. Zeileis Econometric Computing with HC and HAC Covariance Matrix Estimators , 2004 .

[45]  Jeffrey T. Chang,et al.  Oncogenic pathway signatures in human cancers as a guide to targeted therapies , 2006, Nature.

[46]  F. Leisch FlexMix: A general framework for finite mixture models and latent class regression in R , 2004 .

[47]  I. Ellis,et al.  A consensus prognostic gene expression classifier for ER positive breast cancer , 2006, Genome Biology.

[48]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[49]  Philip H. Ramsey Nonparametric Statistical Methods , 1974, Technometrics.

[50]  P. Park,et al.  Discovering statistically significant pathways in expression profiling studies. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[52]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Yudong D. He,et al.  A cell proliferation signature is a marker of extremely poor outcome in a subpopulation of breast cancer patients. , 2005, Cancer research.

[54]  S. Narod,et al.  Triple-Negative Breast Cancer: Clinical Features and Patterns of Recurrence , 2007, Clinical Cancer Research.

[55]  E. Lehmann,et al.  Nonparametrics: Statistical Methods Based on Ranks , 1976 .

[56]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[57]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[58]  Andrew B. Nobel,et al.  A statistical framework for testing functional categories in microarray data , 2008, 0803.3881.

[59]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[60]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[61]  Harry Vrieling,et al.  Analysis of Gene Expression Using Gene Sets Discriminates Cancer Patients with and without Late Radiation Toxicity , 2006, PLoS medicine.

[62]  L. V. van't Veer,et al.  Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. , 2006, Journal of the National Cancer Institute.

[63]  Doheon Lee,et al.  Inferring Pathway Activity toward Precise Disease Classification , 2008, PLoS Comput. Biol..

[64]  C. Molony,et al.  Genetic analysis of genome-wide variation in human gene expression , 2004, Nature.

[65]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[66]  H. Kölbl,et al.  The humoral immune system has a key prognostic impact in node-negative breast cancer. , 2008, Cancer research.

[67]  Gianluca Bontempi,et al.  Predicting prognosis using molecular profiling in estrogen receptor-positive breast cancer treated with tamoxifen , 2008, BMC Genomics.

[68]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[69]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[70]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[71]  Liisa Holm,et al.  Robust extraction of functional signals from gene set analysis using a generalized threshold free scoring function , 2009, BMC Bioinformatics.

[72]  J. Bergh,et al.  Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[73]  Louise C. Showe,et al.  Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data , 2007, BMC Bioinformatics.

[74]  Korbinian Strimmer,et al.  A general modular framework for gene set enrichment analysis , 2009, BMC Bioinformatics.

[75]  D. Mehlman,et al.  Bootstrapping Principal Components Analysis: A Comment , 1995 .