Structure-Leveraged Methods in Breast Cancer Risk Prediction

Predicting breast cancer risk has long been a goal of medical research in the pursuit of precision medicine. The goal of this study is to develop novel penalized methods to improve breast cancer risk prediction by leveraging structure information in electronic health records. We conducted a retrospective case-control study, garnering 49 mammography descriptors and 77 high-frequency/low-penetrance single-nucleotide polymorphisms (SNPs) from an existing personalized medicine data repository. Structured mammography reports and breast imaging features have long been part of a standard electronic health record (EHR), and genetic markers likely will be in the near future. Lasso and its variants are widely used approaches to integrated learning and feature selection, and our methodological contribution is to incorporate the dependence structure among the features into these approaches. More specifically, we propose a new methodology by combining group penalty and [Formula: see text] (1 ≤ p ≤ 2) fusion penalty to improve breast cancer risk prediction, taking into account structure information in mammography descriptors and SNPs. We demonstrate that our method provides benefits that are both statistically significant and potentially significant to people's lives.

[1]  E. Somers International Agency for Research on Cancer. , 1985, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  G. Meijer GLOBOCAN 1: Cancer Incidence and Mortality Worldwide. , 2000 .

[4]  D. Freedman,et al.  On the efficacy of screening for breast cancer. , 2004, International journal of epidemiology.

[5]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[6]  C. McCarty,et al.  Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. , 2005, Personalized medicine.

[7]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[8]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[9]  M. Gail Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk. , 2008, Journal of the National Cancer Institute.

[10]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[11]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[12]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[13]  M. Gail Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. , 2009, Journal of the National Cancer Institute.

[14]  David Page,et al.  Information Extraction for Clinical Data Mining: A Mammography Case Study , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[15]  M. Thun,et al.  Performance of Common Genetic Variants in Breast-cancer Risk Models , 2022 .

[16]  Xiaohui Xie,et al.  Split Bregman method for large scale fused Lasso , 2010, Comput. Stat. Data Anal..

[17]  Jiayu Zhou,et al.  Modeling disease progression via fused sparse group lasso , 2012, KDD.

[18]  Shuang Wang,et al.  Penalized logistic regression for high-dimensional DNA methylation data with case-control studies , 2012, Bioinform..

[19]  David Page,et al.  Graphical-model Based Multiple Testing under Dependence, with Applications to Genome-wide Association Studies , 2012, UAI.

[20]  Jian Huang,et al.  Incorporating group correlations in genome-wide association studies using smoothed group Lasso. , 2013, Biostatistics.

[21]  Yirong Wu,et al.  A Comprehensive Methodology for Determining the Most Informative Mammographic Features , 2013, Journal of Digital Imaging.

[22]  Jaana M. Hartikainen,et al.  Large-scale genotyping identifies 41 new loci associated with breast cancer risk , 2013, Nature Genetics.

[23]  C. Mathers,et al.  GLOBOCAN 2012 v1.0, Cancer Incidence and Mortality Worldwide: IARC CancerBase No. 11 [Internet]. Lyon, France: International Agency for Research on Cancer , 2013 .

[24]  E. Burnside,et al.  New Genetic Variants Improve Personalized Breast Cancer Diagnosis , 2014, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[25]  David Page,et al.  Comparing the Value of Mammographic Features and Genetic Variants in Breast Cancer Risk Prediction , 2014, AMIA.

[26]  Charlotte Wang,et al.  Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies , 2015, PloS one.

[27]  James G. Scott,et al.  Proximal Algorithms in Statistics and Machine Learning , 2015, ArXiv.

[28]  Donghyeon Yu,et al.  Classification of spectral data using fused lasso logistic regression , 2015 .

[29]  Jane E. Carpenter,et al.  Prediction of Breast Cancer Risk Based on Profiling With Common Genetic Variants , 2015, JNCI Journal of the National Cancer Institute.

[30]  C. D. Page,et al.  Comparing Mammography Abnormality Features to Genetic Variants in the Prediction of Breast Cancer in Women Recommended for Breast Biopsy. , 2016, Academic radiology.

[31]  Shiqian Ma,et al.  An Extragradient-Based Alternating Direction Method for Convex Minimization , 2017, Found. Comput. Math..

[32]  C. D'Orsi Breast Imaging Reporting and Data System (BI-RADS) , 2018 .