Risk Prediction of Diabetes: Big data mining with fusion of multifarious physical examination indicators

Abstract Diabetes is a global epidemic. Long-term exposure to hyperglycemia can cause chronic damage to various tissues. Thus, early diagnosis of diabetes is crucial. In this study, we designed a computational system to predict diabetes risk by fusing multifarious types of physical examination data. We collected 1,507,563 physical examination data of healthy people and diabetes patients, as well as 387,076 physical examination data from the follow-up records from 2011 to 2017 of diabetes patients in Luzhou City in China. Three types of physical examination indexes were statistically analyzed: demographics, vital signs, and laboratory values. To distinguish diabetes patients from healthy people, a model based on eXtreme Gradient Boosting (XGBoost) was developed, which could produce an area under the receiver operating characteristic curve (AUC) of 0.8768. Moreover, to improve the convenience and flexibility of the model in clinical and real-life scenarios, a diabetes risk scorecard was established based on logistic regression, which could evaluate human health. Lastly, we statistically analyzed the data from the follow-up records to identify the key factors influencing patient control of their conditions. To improve the diabetes cascade screening and personal lifestyle management, an online diabetes risk assessment system was established, which can be freely accessed at http://lin-group.cn/server/DRSC/index.html . This system is expected to provide guidance for human health management.

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  R Fautz,et al.  In vitro approaches to develop weight of evidence (WoE) and mode of action (MoA) discussions with positive in vitro genotoxicity results. , 2007, Mutagenesis.

[3]  K. Rahimi,et al.  Blood pressure lowering in type 2 diabetes: a systematic review and meta-analysis. , 2015, JAMA.

[4]  Akhtar Hussain,et al.  COVID-19 and diabetes: Knowledge in progress , 2020, Diabetes Research and Clinical Practice.

[5]  L. Mcnaughton,et al.  BMI, leisure-time physical activity, and physical fitness in adults in China: results from a series of national surveys, 2000-14. , 2016, The lancet. Diabetes & endocrinology.

[6]  J. Tukey,et al.  Dyadic anova, an analysis of variance for vectors. , 1949, Human biology.

[7]  Uazman Alam,et al.  General aspects of diabetes mellitus. , 2014, Handbook of clinical neurology.

[8]  Sandhya Ramrakha,et al.  Credit scores, cardiovascular disease risk, and human capital , 2014, Proceedings of the National Academy of Sciences.

[9]  M. Woodward,et al.  Impact of age, age at diagnosis and duration of diabetes on the risk of macrovascular and microvascular complications and death in type 2 diabetes , 2014, Diabetologia.

[10]  Balachandran Manavalan,et al.  iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree , 2018, Computational and structural biotechnology journal.

[11]  Wei Chen,et al.  Design powerful predictor for mRNA subcellular location prediction in Homo sapiens , 2020, Briefings Bioinform..

[12]  R. Ross,et al.  Ethnic influences on the relations between abdominal subcutaneous and visceral adiposity, liver fat, and cardiometabolic risk profile: the International Study of Prediction of Intra-Abdominal Adiposity and Its Relationship With Cardiometabolic Risk/Intra-Abdominal Adiposity. , 2012, The American journal of clinical nutrition.

[13]  Emily A. Knapp,et al.  Consumer credit, chronic disease and risk behaviours , 2018, Journal of Epidemiology & Community Health.

[14]  Jalaluddin Khan,et al.  Intelligent Machine Learning Approach for Effective Recognition of Diabetes in E-Healthcare Using Clinical Data , 2020, Sensors.

[15]  Wei Bao,et al.  Systematic Reviews and Meta-and Pooled Analyses Predicting Risk of Type 2 Diabetes Mellitus with Genetic Risk Models on the Basis of Established Genome-wide Association Markers : A Systematic Review , 2013 .

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  I. Vlahavas,et al.  Machine Learning and Data Mining Methods in Diabetes Research , 2017, Computational and structural biotechnology journal.

[18]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[19]  Yin Zhang,et al.  GroRec: A Group-Centric Intelligent Recommender System Integrating Social, Mobile and Big Data Technologies , 2016, IEEE Transactions on Services Computing.

[20]  Ying Ju,et al.  Predicting Diabetes Mellitus With Machine Learning Techniques , 2018, Front. Genet..

[21]  Huimin Lu,et al.  PEA: Parallel electrocardiogram-based authentication for smart healthcare systems , 2018, J. Netw. Comput. Appl..

[22]  Qingsong Ai,et al.  Mutual-Information-Based Incremental Relaying Communications for Wireless Biomedical Implant Systems , 2018, Sensors.

[23]  Wenying Yang,et al.  Nonlaboratory-Based Risk Assessment Algorithm for Undiagnosed Type 2 Diabetes Developed on a Nation-Wide Diabetes Survey , 2013, Diabetes Care.

[24]  BOULIN,et al.  Classification and Diagnosis of Diabetes. , 2022, Primary care.

[25]  Renzhi Cao,et al.  Survey of Machine Learning Techniques in Drug Discovery. , 2019, Current drug metabolism.

[26]  M. Woodward,et al.  Use of the waist‐to‐height ratio to predict cardiovascular risk in patients with diabetes: Results from the ADVANCE‐ON study , 2018, Diabetes, obesity & metabolism.

[27]  T. Tsunoda,et al.  Assessing the clinical utility of a genetic risk score constructed using 49 susceptibility alleles for type 2 diabetes in a Japanese population. , 2013, The Journal of clinical endocrinology and metabolism.

[28]  Andriy I. Bandos,et al.  On the use of partial area under the ROC curve for comparison of two diagnostic tests , 2015, Biometrical journal. Biometrische Zeitschrift.

[29]  Balachandran Manavalan,et al.  Machine intelligence in peptide therapeutics: A next‐generation tool for rapid disease screening , 2020, Medicinal research reviews.

[30]  Hua Tang,et al.  A two-step discriminated method to identify thermophilic proteins , 2017 .

[31]  Scott E. Belanger,et al.  New approach to weight‐of‐evidence assessment of ecotoxicological effects in regulatory decision‐making , 2017, Integrated environmental assessment and management.

[32]  Bo Wang,et al.  Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities , 2018, Inf. Fusion.

[33]  R. Hanson,et al.  Changes in BMI and weight before and after the development of type 2 diabetes. , 2001, Diabetes care.

[34]  A. Avogaro,et al.  Prevalence and impact of diabetes among people infected with SARS-CoV-2 , 2020, Journal of Endocrinological Investigation.

[35]  D. Owens,et al.  IDF Diabetes Atlas: A review of studies utilising retinal photography on the global prevalence of diabetes related retinopathy between 2015 and 2018. , 2019, Diabetes research and clinical practice.

[36]  Krishna A. Adeshara,et al.  Diabetes and Complications: Cellular Signaling Pathways, Current Understanding and Targeted Therapies. , 2016, Current drug targets.

[37]  Nalini Schaduangrat,et al.  THPep: A machine learning-based approach for predicting tumor homing peptides , 2019, Comput. Biol. Chem..

[38]  Jeffrey N. Rouder,et al.  Model comparison in ANOVA , 2016, Psychonomic bulletin & review.

[39]  Jiu-Xin Tan,et al.  A Survey for Predicting Enzyme Family Classes Using Machine Learning Methods. , 2019, Current drug targets.

[40]  L. Zhang,et al.  A simple Chinese risk score for undiagnosed diabetes , 2009, Diabetic medicine : a journal of the British Diabetic Association.

[41]  B. Sarmento,et al.  SARS-CoV-2 and diabetes: New challenges for the disease , 2020, Diabetes Research and Clinical Practice.