Classification of breast cancer patients using somatic mutation profiles and machine learning approaches

BackgroundThe high degree of heterogeneity observed in breast cancers makes it very difficult to classify the cancer patients into distinct clinical subgroups and consequently limits the ability to devise effective therapeutic strategies. Several classification strategies based on ER/PR/HER2 expression or the expression profiles of a panel of genes have helped, but such methods often produce misleading results due to their dynamic nature. In contrast, somatic DNA mutations are relatively stable and lead to initiation and progression of many sporadic cancers. Hence in this study, we explore the use of gene mutation profiles to classify, characterize and predict the subgroups of breast cancers.ResultsWe analyzed the whole exome sequencing data from 358 ethnically similar breast cancer patients in The Cancer Genome Atlas (TCGA) project. Somatic and non-synonymous single nucleotide variants identified from each patient were assigned a quantitative score (C-score) that represents the extent of negative impact on the gene function. Using these scores with non-negative matrix factorization method, we clustered the patients into three subgroups. By comparing the clinical stage of patients, we identified an early-stage-enriched and a late-stage-enriched subgroup. Comparison of the mutation scores of early and late-stage-enriched subgroups identified 358 genes that carry significantly higher mutations rates in the late stage subgroup. Functional characterization of these genes revealed important functional gene families that carry a heavy mutational load in the late state rich subgroup of patients. Finally, using the identified subgroups, we also developed a supervised classification model to predict the stage of the patients.ConclusionsThis study demonstrates that gene mutation profiles can be effectively used with unsupervised machine-learning methods to identify clinically distinguishable breast cancer subgroups. The classification model developed in this method could provide a reasonable prediction of the cancer patients’ stage solely based on their mutation profiles. This study represents the first use of only somatic mutation profile data to identify and predict breast cancer subgroups and this generic methodology can also be applied to other cancer datasets.

[1]  Simon C. K. Shiu,et al.  Molecular Pattern Discovery Based on Penalized Matrix Decomposition , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  I. Tikhonova,et al.  Genetic diagnosis by whole exome capture and massively parallel DNA sequencing , 2009, Proceedings of the National Academy of Sciences.

[3]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[4]  B. Gusterson Do 'basal-like' breast cancers really exist? , 2009, Nature Reviews Cancer.

[5]  Mingming Jia,et al.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer , 2014, Nucleic Acids Res..

[6]  Adam P. DeLuca,et al.  Computational methods for efficient exome sequencing-based genetic testing , 2013 .

[7]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[8]  M. Cronin,et al.  A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. , 2004, The New England journal of medicine.

[9]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[10]  Jorge S Reis-Filho,et al.  The contribution of gene expression profiling to breast cancer classification, prognostication and prediction: a retrospective of the last decade , 2010, The Journal of pathology.

[11]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[12]  Peilin Jia,et al.  Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers , 2013, Genome Medicine.

[13]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[15]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes , 2013 .

[16]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[18]  Donald L Weaver,et al.  Protocol for the examination of specimens from patients with invasive carcinoma of the breast. , 2009, Archives of pathology & laboratory medicine.

[19]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[20]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[21]  Gottfried Konecny,et al.  Quantitative association between HER-2/neu and steroid hormone receptors in hormone receptor-positive primary breast cancer. , 2003, Journal of the National Cancer Institute.

[22]  Gabor T. Marth,et al.  MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping , 2013, PloS one.

[23]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[24]  Chittibabu Guda,et al.  A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference , 2015, BioMed research international.

[25]  L. Murphy,et al.  Activated mitogen-activated protein kinase expression during human breast tumorigenesis and breast cancer progression. , 2002, Clinical cancer research : an official journal of the American Association for Cancer Research.

[26]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[27]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Ken Chen,et al.  SomaticSniper: identification of somatic point mutations in whole genome sequencing data , 2012, Bioinform..

[29]  Lajos Pusztai,et al.  Molecular classification of breast cancer: limitations and potential. , 2006, The oncologist.

[30]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[31]  S. Paik,et al.  Development of the 21-gene assay and its application in clinical practice and clinical trials. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[32]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Jana Marie Schwarz,et al.  MutationTaster evaluates disease-causing potential of sequence alterations , 2010, Nature Methods.

[34]  Renaud Gaujoux,et al.  A flexible R package for nonnegative matrix factorization , 2010, BMC Bioinformatics.

[35]  B. Stewart,et al.  World cancer report 2014. , 2014 .

[36]  Mi Kim,et al.  Comprehensive evaluation of matrix factorization methods for the analysis of DNA microarray gene expression data , 2011, BMC Bioinformatics.

[37]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[38]  R. Chodankar,et al.  Gene-specific patterns of coregulator requirements by estrogen receptor-α in breast cancer cells. , 2012, Molecular endocrinology.

[39]  E. Boerwinkle,et al.  dbNSFP v2.0: A Database of Human Non‐synonymous SNVs and Their Functional Predictions and Annotations , 2013, Human mutation.

[40]  Alex H. Wagner,et al.  Computational methods for identification of disease-associated variations in exome sequencing , 2014 .

[41]  Michael W. Berry,et al.  Using a literature-based NMF model for discovering gene functional relationships , 2008, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops.

[42]  Andrew M. Gross,et al.  Network-based stratification of tumor mutations , 2013, Nature Methods.

[43]  I. Ellis,et al.  An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer , 2007, Genome Biology.

[44]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[46]  Nicholas J. Wang,et al.  Characterization of a naturally occurring breast cancer subset enriched in epithelial-to-mesenchymal transition and stem cell characteristics. , 2009, Cancer research.

[47]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[48]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[49]  R. Bast,et al.  American Society of Clinical Oncology 2007 update of recommendations for the use of tumor markers in breast cancer. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[50]  I. Ellis,et al.  Pathological prognostic factors in breast cancer. , 1999, Critical reviews in oncology/hematology.

[51]  H WittenIan,et al.  The WEKA data mining software , 2009 .

[52]  J. Izbicki,et al.  Notch signaling activated by replication stress-induced expression of midkine drives epithelial-mesenchymal transition and chemoresistance in pancreatic cancer. , 2011, Cancer research.

[53]  Steven L. Salzberg,et al.  Book Review: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993 , 1994, Machine Learning.

[54]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[55]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[56]  Diego Miranda-Saavedra,et al.  TGFβ induces the formation of tumour-initiating cells in claudinlow breast cancer , 2012, Nature Communications.

[57]  X. Chen,et al.  Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. , 2011, The Journal of clinical investigation.

[58]  L. V. van't Veer,et al.  Clinical application of the 70-gene profile: the MINDACT trial. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[59]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[60]  V. Band,et al.  Histological, molecular and functional subtypes of breast cancers , 2010, Cancer biology & therapy.

[61]  C. Perou,et al.  Molecular Subtypes in Breast Cancer Evaluation and Management: Divide and Conquer , 2008, Cancer investigation.

[62]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[63]  Yongchao Liu,et al.  CUSHAW3: Sensitive and Accurate Base-Space and Color-Space Short-Read Alignment with Hybrid Seeding , 2014, PloS one.

[64]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[65]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[66]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[67]  M. Dunning,et al.  Genome-driven integrated classification of breast cancer validated in over 7,500 samples , 2014, Genome Biology.

[68]  C. Caldas,et al.  Triple negative breast cancers: clinical and prognostic implications. , 2009, European journal of cancer.

[69]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[70]  M. Tada,et al.  Somatic mutations of the APC gene in primary breast cancers. , 2000, The American journal of pathology.

[71]  J. Adams Potential for proteasome inhibition in the treatment of cancer. , 2003, Drug discovery today.

[72]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[73]  Pablo Cingolani,et al.  © 2012 Landes Bioscience. Do not distribute. , 2022 .

[74]  Peter T. Simpson,et al.  Molecular classification of breast cancer , 2014, Virchows Archiv.

[75]  S. Fox,et al.  Aberrant luminal progenitors as the candidate target population for basal tumor development in BRCA1 mutation carriers , 2009, Nature Medicine.

[76]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[77]  Qihua Tan,et al.  Classification of Breast Cancer Subtypes by combining Gene Expression and DNA Methylation Data , 2014, J. Integr. Bioinform..

[78]  Christopher A. Miller,et al.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. , 2012, Genome research.

[79]  Daniel Rios,et al.  Bioinformatics Applications Note Databases and Ontologies Deriving the Consequences of Genomic Variants with the Ensembl Api and Snp Effect Predictor , 2022 .

[80]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[81]  Paul Flicek,et al.  Computational approaches to interpreting genomic sequence variation , 2014, Genome Medicine.

[82]  E. Boerwinkle,et al.  dbNSFP v3.0: A One‐Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice‐Site SNVs , 2016, Human mutation.

[83]  Jason I. Herschkowitz,et al.  Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer , 2010, Breast Cancer Research.

[84]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[85]  David M. Kramer,et al.  Biochemistry and Molecular Biology , 1968, Nature.

[86]  A. Nobel,et al.  The molecular portraits of breast tumors are conserved across microarray platforms , 2006, BMC Genomics.

[87]  B. Kreike,et al.  The molecular underpinning of lobular histological growth pattern: a genome‐wide transcriptomic analysis of invasive lobular carcinomas and grade‐ and molecular subtype‐matched invasive ductal carcinomas of no special type , 2010, The Journal of pathology.

[88]  J. Daling,et al.  Clinical characteristics of different histologic types of breast cancer , 2005, British Journal of Cancer.