Prediction models to identify individuals at risk of metabolic syndrome who are unlikely to participate in a health intervention program

OBJECTIVES Since the launch of a nationwide general health check-up and instruction program in Japan in 2008, interest in strategies to improve implementation of the program based on predictive analytics has grown. We investigated the performance of prediction models developed to identify individuals classified as "requiring instruction" (high-risk) who were unlikely to participate in a health intervention program. METHODS Data were obtained from one large health insurance union in Japan. The study population included individuals who underwent at least one general health check-up between 2008 and 2013 and were identified as "requiring instruction" in 2013. We developed three prediction models based on the gradient boosted trees (GBT), random forest (RF), and logistic regression (LR) algorithms using machine-learning techniques and compared the areas under the curve (AUC) of the developed models with those of two conventional methods The aim of the models was to identify at-risk individuals who were unlikely to participate in the instruction program in 2013 after being classified as requiring instruction at their general health check-up that year. RESULTS At first we performed the analysis using data without multiple imputation. The AUC values for the GBT, RF, and LR prediction models and conventional methods: 1, and 2 were 0.893 (95%CI: 0.882-0.905), 0.889 (95%CI: 0.877-0.901), 0.885 (95%CI: 0.872-0.897), 0.784 (95%CI: 0.767-0.800), and 0.757 (95%CI: 0.741-0.773), respectively. Subsequently, we performed the analysis using data after multiple imputation. The AUC values for the GBT, RF, and LR prediction models and conventional methods: 1, and 2 were 0.894 (95%CI: 0.882-0.906), 0.889 (95%CI: 0.887-0.901), 0.885 (95%CI: 0.872-0.898), 0.784 (95%CI: 0.767-0.800), and 0.757 (95%CI: 0.741-0.773), respectively. In both analyses, the GBT model showed the highest AUC among that of other models, and statistically significant difference were found in comparison with the LR model, conventional method 1, and conventional method 2. CONCLUSION The prediction models using machine-learning techniques outperformed existing conventional methods: for predicting participation in the instruction program among participants identified as "requiring instruction" (high-risk).

[1]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[2]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[3]  K. Glanz,et al.  Health behavior and health education : theory, research, and practice , 1991 .

[4]  A. Bandura Self-efficacy: toward a unifying theory of behavioral change. , 1977, Psychological review.

[5]  G. Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement , 2015, Annals of Internal Medicine.

[6]  K. Zou,et al.  Receiver-Operating Characteristic Analysis for Evaluating Diagnostic Tests and Predictive Models , 2007, Circulation.

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[9]  K. Iglar,et al.  Improving preventive service delivery at adult complete health check-ups: the Preventive health Evidence-based Recommendation Form (PERFORM) cluster randomized controlled trial , 2006, BMC family practice.

[10]  D. Bates,et al.  Big data in health care: using analytics to identify and manage high-risk and high-cost patients. , 2014, Health affairs.

[11]  H. Krumholz Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. , 2014, Health affairs.

[12]  M. Naghavi,et al.  Adult Mortality Attributable to Preventable Risk Factors for Non-Communicable Diseases and Injuries in Japan: A Comparative Risk Assessment , 2012, PLoS medicine.

[13]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[14]  T. Murdoch,et al.  The inevitable application of big data to health care. , 2013, JAMA.

[15]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[16]  Melanie J. Cowan,et al.  Noncommunicable diseases country profiles 2011. , 2011 .

[17]  M. Becker,et al.  The Health Belief Model: A Decade Later , 1984, Health education quarterly.

[18]  Majid Ezzati,et al.  Estimates of global and regional potential health gains from reducing multiple major risk factors , 2003, The Lancet.

[19]  Atsushi Kobayashi Launch of a National Mandatory Chronic Disease Prevention Program in Japan , 2008 .

[20]  Suresh Chalasani,et al.  Predictive analytics on Electronic Health Records (EHRs) using Hadoop and Hive , 2015, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT).

[21]  J. Gerberding,et al.  Actual causes of death in the United States, 2000. , 2004, JAMA.

[22]  I. Guyon,et al.  The Higgs Machine Learning Challenge , 2015 .

[23]  Luxia Wang,et al.  RESEARCH ARTICLE Open Access Prevalence of primary biliary cirrhosis in adults referring hospital for annual health check-up in , 2022 .

[24]  Judi Scheffer,et al.  Dealing with Missing Data , 2020, The Big R‐Book.

[25]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.