Classification of Cancer Primary Sites Using Machine Learning and Somatic Mutations

An accurate classification of human cancer, including its primary site, is important for better understanding of cancer and effective therapeutic strategies development. The available big data of somatic mutations provides us a great opportunity to investigate cancer classification using machine learning. Here, we explored the patterns of 1,760,846 somatic mutations identified from 230,255 cancer patients along with gene function information using support vector machine. Specifically, we performed a multiclass classification experiment over the 17 tumor sites using the gene symbol, somatic mutation, chromosome, and gene functional pathway as predictors for 6,751 subjects. The performance of the baseline using only gene features is 0.57 in accuracy. It was improved to 0.62 when adding the information of mutation and chromosome. Among the predictable primary tumor sites, the prediction of five primary sites (large intestine, liver, skin, pancreas, and lung) could achieve the performance with more than 0.70 in F-measure. The model of the large intestine ranked the first with 0.87 in F-measure. The results demonstrate that the somatic mutation information is useful for prediction of primary tumor sites with machine learning modeling. To our knowledge, this study is the first investigation of the primary sites classification using machine learning and somatic mutation data.

[1]  Andrew M. Gross,et al.  Network-based stratification of tumor mutations , 2013, Nature Methods.

[2]  Lucia A. Hindorff,et al.  Genetic architecture of cancer and other complex diseases: lessons learned and future directions. , 2011, Carcinogenesis.

[3]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[4]  Zhongming Zhao,et al.  Studying tumorigenesis through network evolution and somatic mutational perturbations in the cancer interactome. , 2014, Molecular biology and evolution.

[5]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[7]  Qingxia Chen,et al.  MSEA: detection and quantification of mutation hotspots through mutation set enrichment analysis , 2014, Genome Biology.

[8]  Youping Deng,et al.  Gene selection and classification for cancer microarray data based on machine learning and similarity measures , 2011, BMC Genomics.

[9]  A. Antoniou,et al.  Genetic modifiers of cancer risk for BRCA1 and BRCA2 mutation carriers. , 2011, Annals of oncology : official journal of the European Society for Medical Oncology.

[10]  Douglas F Easton,et al.  BRCA1 and BRCA2 mutations in a population-based study of male breast cancer , 2001, Breast Cancer Research.

[11]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  L. Chin,et al.  Making sense of cancer genomic data. , 2011, Genes & development.

[13]  E. Nexo,et al.  Circulating HER2 DNA after trastuzumab treatment predicts survival and response in breast cancer. , 2010, Anticancer research.

[14]  S. Yaklichkin,et al.  Genomic organization of a new candidate tumor suppressor gene, LRP1B. , 2000, Genomics.

[15]  Benjamin J. Raphael,et al.  Mutational landscape and significance across 12 major cancer types , 2013, Nature.

[16]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[17]  Naftali Tishby,et al.  Sufficient Dimensionality Reduction , 2003, J. Mach. Learn. Res..

[18]  Giske Ursin,et al.  Prevalence and predictors of BRCA1 and BRCA2 mutations in a population-based study of breast cancer in white and black American women ages 35 to 64 years. , 2006, Cancer research.

[19]  klaguia International Network of Cancer Genome Projects , 2010 .

[20]  H. Idikio,et al.  Human Cancer Classification: A Systems Biology- Based Model Integrating Morphology, Cancer Stem Cells, Proteomics, and Genomics , 2011, Journal of Cancer.

[21]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[22]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[23]  Mingming Jia,et al.  COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer , 2010, Nucleic Acids Res..

[24]  H. Kvasnicka WHO Classification of Myeloproliferative Neoplasms (MPN): A Critical Update , 2013, Current Hematologic Malignancy Reports.

[25]  David S. Wishart,et al.  Applications of Machine Learning in Cancer Prediction and Prognosis , 2006, Cancer informatics.

[26]  X. Chen,et al.  Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. , 2011, The Journal of clinical investigation.

[27]  Peilin Jia,et al.  VarWalker: Personalized Mutation Network Analysis of Putative Cancer Genes from Next-Generation Sequencing Data , 2014, PLoS Comput. Biol..

[28]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[29]  Tom Royce,et al.  A comprehensive catalogue of somatic mutations from a human cancer genome , 2010, Nature.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  G. Getz,et al.  High-order chromatin architecture shapes the landscape of chromosomal alterations in cancer , 2012 .

[32]  Peilin Jia,et al.  Patterns and processes of somatic mutations in nine major cancers , 2014, BMC Medical Genomics.

[33]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[34]  Junfeng Xia,et al.  Cancer Biology and Signal Transduction a Meta-analysis of Somatic Mutations from next Generation Sequencing of 241 Melanomas: a Road Map for the Study of Genes with Potential Clinical Relevance , 2022 .

[35]  Robyn L. Ward,et al.  The role of BRCA mutation testing in determining breast cancer therapy , 2010, Nature Reviews Clinical Oncology.

[36]  G. Keating,et al.  Trastuzumab: A review of its use as adjuvant treatment in human epidermal growth factor receptor 2 (HER2)-positive early breast cancer. , 2010, Drugs.

[37]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[38]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[39]  K. Kinzler,et al.  Cancer Genome Landscapes , 2013, Science.

[40]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[41]  Steven A. Roberts,et al.  Mutational heterogeneity in cancer and the search for new cancer-associated genes , 2013 .

[42]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[43]  Sun-Young Kong,et al.  Serum HER2 as a response indicator to various chemotherapeutic agents in tissue HER2 positive metastatic breast cancer. , 2006, Cancer research and treatment : official journal of Korean Cancer Association.

[44]  S. Gabriel,et al.  Discovery and saturation analysis of cancer genes across 21 tumor types , 2014, Nature.