Incorporating External Risk Information with the Cox Model under Population Heterogeneity: Applications to Trans-Ancestry Polygenic Hazard Scores

Polygenic hazard score (PHS) models designed for European ancestry (EUR) individuals provide ample information regarding survival risk discrimination. Incorporating such information can improve the performance of risk discrimination in an internal small-sized non-EUR cohort. However, given that external EUR-based model and internal individual-level data come from different populations, ignoring population heterogeneity can introduce substantial bias. In this paper, we develop a Kullback-Leibler-based Cox model (CoxKL) to integrate internal individual-level time-to-event data with external risk scores derived from published prediction models, accounting for population heterogeneity. Partial-likelihood-based KL information is utilized to measure the discrepancy between the external risk information and the internal data. We establish the asymptotic properties of the CoxKL estimator. Simulation studies show that the integration model by the proposed CoxKL method achieves improved estimation efficiency and prediction accuracy. We applied the proposed method to develop a trans-ancestry PHS model for prostate cancer and found that integrating a previously published EUR-based PHS with an internal genotype data of African ancestry (AFR) males yielded considerable improvement on the prostate cancer risk discrimination.

[1]  D. Scharre,et al.  Explainable machine learning aggregates polygenic risk scores and electronic health records for Alzheimer’s disease prediction , 2023, Scientific Reports.

[2]  E. Hector,et al.  Turning the information-sharing dial: efficient inference from different data sources , 2022, 2207.08886.

[3]  A. Jemal,et al.  Cancer statistics, 2022 , 2022, CA: a cancer journal for clinicians.

[4]  G. Abecasis,et al.  The Michigan Genomics Initiative: A biobank linking genotypes and electronic clinical records in Michigan Medicine patients , 2021, medRxiv.

[5]  Yifei Sun,et al.  Synthesizing external aggregated information in the penalized Cox regression under population heterogeneity , 2021, Statistics in medicine.

[6]  J. M. Taylor,et al.  Kullback-Leibler-Based Discrete Failure Time Models for Integration of Published Prediction Models with New Time-To-Event Dataset , 2021, 2101.02354.

[7]  Han Zhang,et al.  Generalized integration model for improved statistical inference by leveraging external summary data , 2020 .

[8]  Jing Ning,et al.  Combining primary cohort data with external aggregate information without assuming comparability , 2020, Biometrics.

[9]  Erratum: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. , 2020, CA: a cancer journal for clinicians.

[10]  Jianxin Shi,et al.  A Penalized Regression Framework for Building Polygenic Risk Models Based on Summary Statistics From Genome-Wide Association Studies and Incorporating External Information , 2020, Journal of the American Statistical Association.

[11]  Jack A. Taylor,et al.  African‐specific improvement of a polygenic hazard score for age at diagnosis of prostate cancer , 2020, medRxiv.

[12]  K. D. Sørensen,et al.  Polygenic hazard score is associated with prostate cancer in multi-ethnic populations , 2019, Nature Communications.

[13]  P. Breheny,et al.  Cross-validation approaches for penalized Cox regression , 2019, Statistical methods in medical research.

[14]  J. Cavanaugh,et al.  Partial Likelihood , 2018, Wiley StatsRef: Statistics Reference Online.

[15]  A. Jemal,et al.  Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , 2018, CA: a cancer journal for clinicians.

[16]  Bhramar Mukherjee,et al.  Empirical Bayes Estimation and Prediction Using Summary-Level Information From External Big Data Sources Adjusting for Violations of Transportability , 2018, Statistics in Biosciences.

[17]  David R Williams,et al.  Lack Of Diversity In Genomic Databases Is A Barrier To Translating Precision Medicine Research Into Practice. , 2018, Health affairs.

[18]  O. Andreassen,et al.  Polygenic hazard score to guide screening for aggressive prostate cancer: development and validation in large scale cohorts , 2018, British Medical Journal.

[19]  S. Gabriel,et al.  Exome Sequencing of African-American Prostate Cancer Reveals Loss-of-Function ERF Mutations. , 2017, Cancer discovery.

[20]  Christopher R. Gignoux,et al.  Human demographic history impacts genetic risk prediction across diverse populations , 2016, bioRxiv.

[21]  Pak Chung Sham,et al.  Polygenic scores via penalized regression on summary statistics , 2016, bioRxiv.

[22]  Kassandra I. Alcaraz,et al.  Cancer statistics for African Americans, 2016: Progress and opportunities in reducing racial disparities , 2016, CA: a cancer journal for clinicians.

[23]  Huei-Ting Tsai,et al.  Efficient Estimation of the Cox Model with Auxiliary Subgroup Survival Information , 2016, Journal of the American Statistical Association.

[24]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[25]  Heping Zhang,et al.  Variable Selection With Prior Information for Generalized Linear Models via the Prior LASSO Method , 2016, Journal of the American Statistical Association.

[26]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[27]  T. Tammela,et al.  Screening and prostate cancer mortality: results of the European Randomised Study of Screening for Prostate Cancer (ERSPC) at 13 years of follow-up , 2014, The Lancet.

[28]  Karel G M Moons,et al.  Meta‐analysis and aggregation of multiple published prediction models , 2014, Statistics in medicine.

[29]  Wei Pan,et al.  Penalized regression and risk prediction in genome‐wide association studies , 2013, Stat. Anal. Data Min..

[30]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[31]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[32]  Robert E. Schapire,et al.  Boosting with prior knowledge for call classification , 2005, IEEE Transactions on Speech and Audio Processing.

[33]  Laurence L. George,et al.  The Statistical Analysis of Failure Time Data , 2003, Technometrics.

[34]  Harry Shum,et al.  Kullback-Leibler boosting , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[35]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[36]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[37]  P. J. Verweij,et al.  Cross-validation in survival analysis. , 1993, Statistics in medicine.

[38]  Niels Keiding,et al.  Statistical Models Based on Counting Processes , 1993 .

[39]  R. Prentice,et al.  Commentary on Andersen and Gill's "Cox's Regression Model for Counting Processes: A Large Sample Study" , 1982 .

[40]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[41]  D.,et al.  Regression Models and Life-Tables , 2022 .