Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in Finnish cases and controls

We propose an effective machine learning approach to identify group of interacting single nucleotide polymorphisms (SNPs), which contribute most to the breast cancer (BC) risk by assuming dependencies among BCAC iCOGS SNPs. We adopt a gradient tree boosting method followed by an adaptive iterative SNP search to capture complex non-linear SNP-SNP interactions and consequently, obtain group of interacting SNPs with high BC risk-predictive potential. We also propose a support vector machine formed by the identified SNPs to classify BC cases and controls. Our approach achieves mean average precision (mAP) of 72.66, 67.24 and 69.25 in discriminating BC cases and controls in KBCP, OBCS and merged KBCP-OBCS sample sets, respectively. These results are better than the mAP of 70.08, 63.61 and 66.41 obtained by using a polygenic risk score model derived from 51 known BC-associated SNPs, respectively, in KBCP, OBCS and merged KBCP-OBCS sample sets. BC subtype analysis further reveals that the 200 identified KBCP SNPs from the proposed method performs favorably in classifying estrogen receptor positive (ER+) and negative (ER−) BC cases both in KBCP and OBCS data. Further, a biological analysis of the identified SNPs reveals genes related to important BC-related mechanisms, estrogen metabolism and apoptosis.

[1]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[2]  Teri A Manolio,et al.  Genomewide association studies and assessment of the risk of disease. , 2010, The New England journal of medicine.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  K. Lunetta,et al.  Correction for multiple testing in a gene region , 2013, European Journal of Human Genetics.

[5]  E. Wang,et al.  Predictive genomics: a cancer hallmark network framework for predicting tumor clinical phenotypes using genome sequencing data. , 2014, Seminars in cancer biology.

[6]  Peter Kraft,et al.  Genome-Wide Meta-Analyses of Breast, Ovarian, and Prostate Cancer Association Studies Identify Multiple New Susceptibility Loci Shared by at Least Two Cancer Types. , 2016, Cancer discovery.

[7]  Andreas Zell,et al.  Influence of Feature Encoding and Choice of Classifier on Disease Risk Prediction in Genome-Wide Association Studies , 2015, PloS one.

[8]  K. Komurov,et al.  A comparative survey of functional footprints of EGFR pathway mutations in human cancers , 2014, Oncogene.

[9]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[10]  Peter Jüni,et al.  A randomized multicenter trial comparing the XIENCE everolimus eluting stent with the CYPHER sirolimus eluting stent in the treatment of female patients with de novo coronary artery lesions: The SPIRIT WOMEN study , 2017, PloS one.

[11]  Jinfeng Zou,et al.  Identification and Construction of Combinatory Cancer Hallmark-Based Gene Signature Sets to Predict Recurrence and Chemotherapy Benefit in Stage II Colorectal Cancer. , 2016, JAMA oncology.

[12]  V Kishore Ayyadevara,et al.  Gradient Boosting Machine , 2018 .

[13]  R. Beroukhim,et al.  Pan-Cancer Analysis Links PARK2 to BCL-XL-Dependent Control of Apoptosis , 2016, Neoplasia.

[14]  Long Chen,et al.  Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation , 2017 .

[15]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[16]  Weimin Fan,et al.  A miR-20a/MAPK1/c-Myc regulatory feedback loop regulates breast carcinogenesis and chemoresistance , 2017, Cell Death and Differentiation.

[17]  N. Schork,et al.  Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. , 2008, American journal of human genetics.

[18]  Edwin Wang,et al.  eTumorType, An Algorithm of Discriminating Cancer Types for Circulating Tumor Cells or Cell-free DNAs in Blood , 2017, Genom. Proteom. Bioinform..

[19]  Y. Li,et al.  The EGFR/miR-338-3p/EYA2 axis controls breast tumor growth and lung metastasis , 2017, Cell Death & Disease.

[20]  Andrew J. Wilson,et al.  Hdac3 is essential for the maintenance of chromatin structure and genome stability. , 2010, Cancer cell.

[21]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[22]  J. Chang-Claude,et al.  Heritability Estimation using a Regularized Regression Approach (HERRA): Applicable to continuous, dichotomous or age-at-onset outcome , 2017, PloS one.

[23]  Oliver Stegle,et al.  A Lasso multi-marker mixed model for association mapping with population structure correction , 2013, Bioinform..

[24]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[25]  Matti Pirinen,et al.  Fine-Scale Genetic Structure in Finland , 2017, G3: Genes, Genomes, Genetics.

[26]  Jane E. Carpenter,et al.  Prediction of Breast Cancer Risk Based on Profiling With Common Genetic Variants , 2015, JNCI Journal of the National Cancer Institute.

[27]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[28]  Gary D Bader,et al.  Association analysis identifies 65 new breast cancer risk loci , 2017, Nature.

[29]  D. Ojcius,et al.  TRAIL-R1 Is a Negative Regulator of Pro-Inflammatory Responses and Modulates Long-Term Sequelae Resulting from Chlamydia trachomatis Infections in Humans , 2014, PloS one.

[30]  Q. Cui,et al.  Identification of high-quality cancer prognostic markers and metastasis network modules , 2010, Nature communications.

[31]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[32]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[33]  R. Weinshilboum,et al.  Genome-wide association studies of drug response and toxicity: an opportunity for genome medicine , 2016, Nature Reviews Drug Discovery.

[34]  Sergio Contrino,et al.  modMine: flexible access to modENCODE data , 2011, Nucleic Acids Res..

[35]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[36]  W. Chung,et al.  Evaluation of Polygenic Risk Scores for Breast and Ovarian Cancer Risk Prediction in BRCA1 and BRCA2 Mutation Carriers , 2017, Journal of the National Cancer Institute.

[37]  Carsten O. Peterson,et al.  Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. , 2001, Cancer research.

[38]  M. Bell,et al.  Calling Where It Counts: Subordinate Pied Babblers Target the Audience of Their Vocal Advertisements , 2015, PloS one.

[39]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[40]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[41]  Patrick Neven,et al.  Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer , 2015 .

[42]  Gos Micklem,et al.  esyN: Network Building, Sharing and Publishing , 2014, PloS one.

[43]  K. Roeder,et al.  Screen and clean: a tool for identifying interactions in genome‐wide association studies , 2010, Genetic epidemiology.

[44]  A. Buja,et al.  Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications , 2005 .

[45]  Seunghak Lee,et al.  A network-driven approach for genome-wide association mapping , 2016, Bioinform..

[46]  A. Mannermaa,et al.  Refinement of the 22q12-q13 Breast Cancer–Associated Region: Evidence of TMPRSS6 as a Candidate Gene in an Eastern Finnish Population , 2006, Clinical Cancer Research.

[47]  Jason H. Moore,et al.  STUDENTJAMA. The challenges of whole-genome approaches to common diseases. , 2004, JAMA.

[48]  Jianhua Li,et al.  A Novel Image Classification Method with CNN-XGBoost Model , 2017, IWDW.

[49]  J. Schleutker,et al.  Case-control analysis of truncating mutations in DNA damage response genes connects TEX15 and FANCD2 with hereditary breast cancer susceptibility , 2017, Scientific Reports.

[50]  Nick C Fox,et al.  Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease , 2013, Nature Genetics.