Data Perturbation Independent Diagnosis and Validation of Breast Cancer Subtypes Using Clustering and Patterns

Molecular stratification of disease based on expression levels of sets of genes can help guide therapeutic decisions if such classifications can be shown to be stable against variations in sample source and data perturbation. Classifications inferred from one set of samples in one lab should be able to consistently stratify a different set of samples in another lab. We present a method for assessing such stability and apply it to the breast cancer (BCA) datasets of Sorlie et al. 2003 and Ma et al. 2003. We find that within the now commonly accepted BCA categories identified by Sorlie et al. Luminal A and Basal are robust, but Luminal B and ERBB2+ are not. In particular, 36% of the samples identified as Luminal B and 55% identified as ERBB2+ cannot be assigned an accurate category because the classification is sensitive to data perturbation. We identify a “core cluster” of samples for each category, and from these we determine “patterns” of gene expression that distinguish the core clusters from each other. We find that the best markers for Luminal A and Basal are (ESR1, LIV1, GATA-3) and (CCNE1, LAD1, KRT5), respectively. Pathways enriched in the patterns regulate apoptosis, tissue remodeling and the immune response. We use a different dataset (Ma et al. 2003) to test the accuracy with which samples can be allocated to the four disease subtypes. We find, as expected, that the classification of samples identified as Luminal A and Basal is robust but classification into the other two subtypes is not.

[1]  Peter L. Hammer,et al.  Spanned patterns for the logical analysis of data , 2006, Discret. Appl. Math..

[2]  A. Ullrich,et al.  Involvement of the FGFR4 Arg388 allele in head and neck squamous cell carcinoma , 2004, International journal of cancer.

[3]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[4]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[5]  A. Agarwal,et al.  PAR1 Is a Matrix Metalloprotease-1 Receptor that Promotes Invasion and Tumorigenesis of Breast Cancer Cells , 2005, Cell.

[6]  E. Lehmann,et al.  Nonparametrics: Statistical Methods Based on Ranks , 1976 .

[7]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Carsten O. Peterson,et al.  Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. , 2001, Cancer research.

[9]  Ying Liu,et al.  A Comparative Study on Feature Selection Methods for Drug Discovery , 2004, J. Chem. Inf. Model..

[10]  Michael J. Pazzani,et al.  Classification and regression by combining models , 1998 .

[11]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[12]  C. Bonferroni Il calcolo delle assicurazioni su gruppi di teste , 1935 .

[13]  Rivat Christine,et al.  Implication of STAT3 Signaling in Human Colonic Cancer Cells during Intestinal Trefoil Factor 3 (TFF3) – and Vascular Endothelial Growth Factor–Mediated Cellular Invasion and Tumor Growth , 2005, Cancer Research.

[14]  David W. Mount,et al.  Pathway Miner: extracting gene association networks from molecular pathways for predicting the biological significance of gene expression microarray data , 2004, Bioinform..

[15]  P. Hammer,et al.  Breast cancer prognosis by combinatorial analysis of gene expression data , 2006, Breast Cancer Research.

[16]  J. Paulson,et al.  Clinical significance of ST3Gal IV expression in human renal cell carcinoma. , 2002, Oncology reports.

[17]  T. Tsukamoto,et al.  Introduction of the c-kit gene leads to growth suppression of a breast cancer cell line, MCF-7. , 1996, Anticancer research.

[18]  E. Bowman,et al.  Identification of carboxypeptidase E and gamma-glutamyl hydrolase as biomarkers for pulmonary neuroendocrine tumors by cDNA microarray. , 2004, Human pathology.

[19]  L. Skoog,et al.  Role of cyclin D1 in ErbB2-positive breast cancer and tamoxifen resistance , 2005, Breast Cancer Research and Treatment.

[20]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[21]  T. Oda,et al.  Cytoplasmic expression of laminin γ2 chain correlates with postoperative hepatic metastasis and poor prognosis in patients with pancreatic ductal adenocarcinoma , 2002, Cancer.

[22]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[23]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[24]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[25]  I. Bièche,et al.  Genetic alterations in breast cancer , 1995, Genes, chromosomes & cancer.

[26]  W. Heiser,et al.  Instability of hierarchical cluster analysis due to input order of the data: the PermuCLUSTER solution. , 2005, Psychological methods.

[27]  P. Fumoleau,et al.  G388R mutation of the FGFR4 gene is not relevant to breast cancer prognosis , 2004, British Journal of Cancer.

[28]  T. Sørlie,et al.  Distinct molecular mechanisms underlying clinically relevant subtypes of breast cancer: gene expression analyses across three different platforms , 2006, BMC Genomics.

[29]  RAINER BREITLING,et al.  Rank-based Methods as a Non-parametric Alternative of the T-statistic for the Analysis of Biological Microarray Data , 2005, J. Bioinform. Comput. Biol..

[30]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[31]  Daniel Birnbaum,et al.  Gene expression profiling identifies molecular subtypes of inflammatory breast cancer. , 2005, Cancer research.

[32]  Andreas Rytz,et al.  The limit fold change model: A practical approach for selecting differentially expressed genes from microarray data , 2002, BMC Bioinformatics.

[33]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[34]  Peter L. Hammer,et al.  Consensus algorithms for the generation of all maximal bicliques , 2004, Discret. Appl. Math..

[35]  James Lyons-Weiler,et al.  caGEDA: a web application for the integrated analysis of global gene expression patterns in cancer , 2004, Applied bioinformatics.

[36]  J. Royds,et al.  Nuclear localization of Y-box factor YB1 requires wild-type p53 , 2003, Oncogene.

[37]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[38]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Vessela N Kristensen,et al.  Gene expression profiling of breast cancer in relation to estrogen receptor status and estrogen-metabolizing enzymes: clinical implications. , 2005, Clinical cancer research : an official journal of the American Association for Cancer Research.

[40]  E. Yorida,et al.  Akt phosphorylates the Y-box binding protein 1 at Ser102 located in the cold shock domain and affects the anchorage-independent growth of breast cancer cells , 2005, Oncogene.

[41]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[42]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Ilya Shmulevich,et al.  In silico microdissection of microarray data from heterogeneous cell populations , 2005, BMC Bioinformatics.

[44]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[45]  C. Gajdos,et al.  Reversal of tamoxifen resistant breast cancer by low dose estrogen therapy , 2005, The Journal of Steroid Biochemistry and Molecular Biology.

[46]  David Cameron,et al.  Identification of molecular apocrine breast tumours by microarray analysis , 2005, Oncogene.

[47]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Xiaoxing Liu,et al.  An Entropy-based gene selection method for cancer classification using microarray data , 2005, BMC Bioinformatics.

[49]  M K Markey,et al.  Application of the mutual information criterion for feature selection in computer-aided diagnosis. , 2001, Medical physics.

[50]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[51]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[52]  A. Nobel,et al.  The molecular portraits of breast tumors are conserved across microarray platforms , 2006, BMC Genomics.

[53]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[54]  Marcel J. T. Reinders,et al.  A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets , 2006, BMC Bioinformatics.

[55]  A. Hönig,et al.  Preoperative chemotherapy and endocrine therapy in patients with breast cancer. , 2004, Clinical breast cancer.

[56]  M. Lacroix,et al.  The “portrait” of hereditary breast cancer , 2005, Breast Cancer Research and Treatment.

[57]  Liang Goh,et al.  An Integrated Feature Selection and Classification Method to Select Minimum Number of Variables on the Case Study of Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[58]  E. Bruyneel,et al.  Implication of STAT3 signaling in human colonic cancer cells during intestinal trefoil factor 3 (TFF3) -- and vascular endothelial growth factor-mediated cellular invasion and tumor growth. , 2005, Cancer research.

[59]  Y. Soini,et al.  Distribution of basement membrane anchoring molecules in normal and transformed endometrium: Altered expression of laminin γ2 chain and collagen type XVII in endometrial adenocarcinomas , 2004, Journal of Molecular Histology.

[60]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[61]  E. Gelfand,et al.  Cyclin-dependent kinase 6 inhibits proliferation of human mammary epithelial cells. , 2004, Molecular cancer research : MCR.

[62]  Philip M. Long,et al.  Breast cancer classification and prognosis based on gene expression profiles from a population-based study , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[63]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[64]  R. Espinosa,et al.  Amplification and overexpression of peroxisome proliferator-activated receptor binding protein (PBP/PPARBP) gene in breast cancer. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[65]  G Alexe,et al.  Logical analysis of diffuse large B-cell lymphomas , 2005, Artif. Intell. Medicine.

[66]  Cesare Furlanello,et al.  Entropy-based gene ranking without selection bias for the predictive classification of microarray data , 2003, BMC Bioinformatics.

[67]  János Szöllosi,et al.  Epidermal growth factor receptor coexpression modulates susceptibility to Herceptin in HER2/neu overexpressing breast cancer cells via specific erbB-receptor interaction and activation. , 2005, Experimental cell research.

[68]  R. Weinberg,et al.  The Biology of Cancer , 2006 .

[69]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[70]  D. Hanahan,et al.  The Hallmarks of Cancer , 2000, Cell.

[71]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[72]  Peter L. Hammer,et al.  Pattern-based feature selection in genomics and proteomics , 2006, Ann. Oper. Res..

[73]  Rohini Sharma,et al.  Systematic review of LHRH agonists for the adjuvant treatment of early breast cancer. , 2005, Breast.

[74]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[75]  Gabriela Alexe,et al.  Robust diagnosis of non-Hodgkin lymphoma phenotypes validated on gene expression data from different laboratories. , 2005, Genome informatics. International Conference on Genome Informatics.

[76]  Christine Desmedt,et al.  Breast cancer gene expression profiling: clinical trial and practice implications. , 2005, Pharmacogenomics.

[77]  R. Salunga,et al.  Gene expression profiles of human breast cancer progression , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[78]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[79]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[80]  G. Ball,et al.  High‐throughput protein expression analysis using tissue microarray technology of a large well‐characterised series identifies biologically distinct classes of breast cancer confirming recent cDNA expression analyses , 2005, International journal of cancer.

[81]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[82]  C. Perou,et al.  Cell-Type-Specific Responses to Chemotherapeutics in Breast Cancer , 2004, Cancer Research.

[83]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[84]  Peter L. Hammer,et al.  Comprehensive vs. comprehensible classifiers in logical analysis of data , 2008, Discret. Appl. Math..

[85]  J. Anim,et al.  Relationship between the expression of various markers and prognostic factors in breast cancer. , 2005, Acta histochemica.

[86]  F. Bertucci,et al.  Gene expression profiling of breast cell lines identifies potential new basal markers , 2006, Oncogene.

[87]  T. Golub,et al.  Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. , 2004, Blood.

[88]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[89]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[90]  V. Arango,et al.  Using the Gene Ontology for Microarray Data Mining: A Comparison of Methods and Application to Age Effects in Human Prefrontal Cortex , 2004, Neurochemical Research.

[91]  Y. Crama,et al.  Cause-effect relationships and partially defined Boolean functions , 1988 .

[92]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[93]  Toshihide Ibaraki,et al.  CAUSE-EFFECT RELATIONSHIPS AND PARTIALLY DEFINED , 1988 .

[94]  J. V. Bradley Distribution-Free Statistical Tests , 1968 .