Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method

Significance Prevention and early diagnosis of cancer are the most effective ways of avoiding psychological, physical, and financial suffering from cancer. We present a machine-learning method for statistically predicting individuals’ inherited susceptibility (and environmental/lifestyle factors by inference) for acquiring the most likely type among a panel of 20 major common cancer types plus 1 “healthy” type. The results show that, depending on the type, about 33–88% of a cancer cohort have acquired its cancer type primarily due to inherited genomic susceptibility factors and that the rest are primarily due to environmental/lifestyle factors. These personal genomic susceptibilities with associated probabilities may provide practical information for individuals, heath professionals, and health policymakers related to prevention and/or early intervention of cancer. Prevention and early intervention are the most effective ways of avoiding or minimizing psychological, physical, and financial suffering from cancer. However, such proactive action requires the ability to predict the individual’s susceptibility to cancer with a measure of probability. Of the triad of cancer-causing factors (inherited genomic susceptibility, environmental factors, and lifestyle factors), the inherited genomic component may be derivable from the recent public availability of a large body of whole-genome variation data. However, genome-wide association studies have so far showed limited success in predicting the inherited susceptibility to common cancers. We present here a multiple classification approach for predicting individuals’ inherited genomic susceptibility to acquire the most likely phenotype among a panel of 20 major common cancer types plus 1 “healthy” type by application of a supervised machine-learning method under competing conditions among the cohorts of the 21 types. This approach suggests that, depending on the phenotypes of 5,919 individuals of “white” ethnic population in this study, (i) the portion of the cohort of a cancer type who acquired the observed type due to mostly inherited genomic susceptibility factors ranges from about 33 to 88% (or its corollary: the portion due to mostly environmental and lifestyle factors ranges from 12 to 67%), and (ii) on an individual level, the method also predicts individuals’ inherited genomic susceptibility to acquire the other types ranked with associated probabilities. These probabilities may provide practical information for individuals, heath professionals, and health policymakers related to prevention and/or early intervention of cancer.

[1]  J. Witte,et al.  Polygenic modeling of genome-wide association studies: an application to prostate and breast cancer. , 2011, Omics : a journal of integrative biology.

[2]  Orli G. Bahcall Common variation and heritability estimates for breast, ovarian and prostate cancers , 2013, Nature Genetics.

[3]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[4]  Zoran Bosnic,et al.  ROC analysis of classifiers in machine learning: A survey , 2013, Intell. Data Anal..

[5]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[6]  Minseung Kim,et al.  Empirical prediction of genomic susceptibilities for multiple cancer classes , 2014, Proceedings of the National Academy of Sciences.

[7]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[8]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[9]  Montgomery Slatkin,et al.  Linkage disequilibrium — understanding the evolutionary past and mapping the medical future , 2008, Nature Reviews Genetics.

[10]  Juan Carlos Fernández,et al.  Multiobjective evolutionary algorithms to identify highly autocorrelated areas: the case of spatial distribution in financially compromised farms , 2014, Ann. Oper. Res..

[11]  F. Lalloo Diagnosis and Management of Hereditary Phaeochromocytoma and Paraganglioma. , 2016, Recent results in cancer research. Fortschritte der Krebsforschung. Progres dans les recherches sur le cancer.

[12]  Teri A Manolio,et al.  Genomewide association studies and assessment of the risk of disease. , 2010, The New England journal of medicine.

[13]  N. Dubrawsky Cancer statistics , 1989, CA: a cancer journal for clinicians.

[14]  Pang-Ning Tan,et al.  kNN: k-Nearest Neighbors , 2009 .

[15]  Francesco Corea,et al.  Introduction to Data , 2017, IBM SPSS Essentials.

[16]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[17]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[18]  A. Jemal,et al.  Cancer Statistics, 2008 , 2008, CA: a cancer journal for clinicians.

[19]  Rongling Li,et al.  Quality Control Procedures for Genome‐Wide Association Studies , 2011, Current protocols in human genetics.

[20]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..

[21]  A. Morris,et al.  Data quality control in genetic case-control association studies , 2010, Nature Protocols.

[22]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.