Enhancing the Human Health Status Prediction: The ATHLOS Project

ABSTRACT Preventive healthcare is a crucial pillar of health as it contributes to staying healthy and having immediate treatment when needed. Mining knowledge from longitudinal studies has the potential to significantly contribute to the improvement of preventive healthcare. Unfortunately, data originated from such studies are characterized by high complexity, huge volume, and a plethora of missing values. Machine Learning, Data Mining and Data Imputation models are utilized a part of solving these challenges, respectively. Toward this direction, we focus on the development of a complete methodology for the ATHLOS Project – funded by the European Union’s Horizon 2020 Research and Innovation Program, which aims to achieve a better interpretation of the impact of aging on health. The inherent complexity of the provided dataset lies in the fact that the project includes 15 independent European and international longitudinal studies of aging. In this work, we mainly focus on the HealthStatus (HS) score, an index that estimates the human status of health, aiming to examine the effect of various data imputation models to the prediction power of classification and regression models. Our results are promising, indicating the critical importance of data imputation in enhancing preventive medicine’s crucial role.

[1]  Brian Caulfield,et al.  Automatic Prediction of Health Status Using Smartphone-Derived Behavior Profiles , 2017, IEEE Journal of Biomedical and Health Informatics.

[2]  Athanasios V. Vasilakos,et al.  Machine learning on big data: Opportunities and challenges , 2017, Neurocomputing.

[3]  Jagdish Prasad,et al.  Estimation of Missing Values in the Data Mining and Comparison of Imputation Methods , 2013 .

[4]  Hideki Hashimoto,et al.  JSTAR First Results 2009 Report , 2009 .

[5]  Alzheimer's Disease Neuroimaging Initiative,et al.  Development and Validation of a Dementia Risk Prediction Model in the General Population: An Analysis of Three Longitudinal Studies. , 2019, The American journal of psychiatry.

[6]  F. F. Caballero,et al.  Machine learning methodologies versus cardiovascular risk scores, in predicting disease risk , 2018, BMC Medical Research Methodology.

[7]  M. Prince,et al.  Cohort profile Cohort Profile : The 10 / 66 study , 2016 .

[8]  J. Park An overview of Korean longitudinal study on health and aging , 2007 .

[9]  F. De Rango,et al.  Human longevity: Genetics or Lifestyle? It takes two to tango , 2016, Immunity & Ageing.

[10]  Jeremy N. V. Miles,et al.  R Squared, Adjusted R Squared† , 2005 .

[11]  Alex A. Freitas,et al.  A data-driven missing value imputation approach for longitudinal datasets , 2021, Artificial Intelligence Review.

[12]  W. Liang,et al.  Longitudinal hematologic and immunologic variations associated with the progression of COVID-19 patients in China , 2020, Journal of Allergy and Clinical Immunology.

[13]  M. Hotopf,et al.  Mental health before and during the COVID-19 pandemic: a longitudinal probability sample survey of the UK population , 2020, The Lancet Psychiatry.

[14]  Amanda Sonnega,et al.  Cohort Profile: the Health and Retirement Study (HRS). , 2014, International journal of epidemiology.

[15]  Bach Tran,et al.  A longitudinal study on the mental health of general population during the COVID-19 epidemic in China , 2020, Brain, Behavior, and Immunity.

[16]  Tra My Pham,et al.  Missing data and multiple imputation in clinical epidemiological research , 2017, Clinical epidemiology.

[17]  G. Savva,et al.  Design and Methodology of The Irish Longitudinal Study on Ageing , 2013, Journal of the American Geriatrics Society.

[18]  Andrew Steptoe,et al.  COHORT PROFILE Cohort Profile : The English Longitudinal Study of Ageing , 2014 .

[19]  Joshua C Denny,et al.  Learning from Longitudinal Data in Electronic Health Record and Genetic Data to Improve Cardiovascular Event Prediction , 2018, Scientific Reports.

[20]  Constantine Frangakis,et al.  Multiple imputation by chained equations: what is it and how does it work? , 2011, International journal of methods in psychiatric research.

[21]  Munindar P. Singh,et al.  Triaging Patient Complaints: Monte Carlo Cross-Validation of Six Machine Learning Classifiers , 2017, JMIR medical informatics.

[22]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[23]  Jenna Wiens,et al.  Machine Learning for Healthcare: On the Verge of a Major Shift in Healthcare Epidemiology , 2018, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[24]  Perianayagam Arokiasamy,et al.  Longitudinal Aging Study in India: Vision, Design, Implementation, and Preliminary Findings * , 2012 .

[25]  Nina Zumel,et al.  Practical Data Science with R , 2014 .

[26]  Nina Zumel,et al.  vtreat: a data.frame Processor for Predictive Modeling , 2016, 1611.09477.

[27]  Konstantin Eckle,et al.  A comparison of deep networks with ReLU activation function and linear spline-type methods , 2018, Neural Networks.

[28]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[29]  T. Chai,et al.  Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature , 2014 .

[30]  Josep Alfons Espinàs,et al.  Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival , 2020, Artif. Intell. Medicine.

[31]  Anurika Priyanjali De Silva,et al.  A comparison of multiple imputation methods for handling missing values in longitudinal data in the presence of a time-varying covariate with a non-linear association with time: a simulation study , 2017, BMC Medical Research Methodology.

[32]  Md Hamidul Huque,et al.  A comparison of multiple imputation methods for missing data in longitudinal studies , 2018, BMC Medical Research Methodology.

[33]  Perianayagam Arokiasamy,et al.  Data resource profile: the World Health Organization Study on global AGEing and adult health (SAGE). , 2012, International journal of epidemiology.

[34]  E. R. van den Heuvel,et al.  Strategies for handling missing data in longitudinal studies with questionnaires , 2018, Journal of Statistical Computation and Simulation.

[35]  G. Andrews,et al.  THE AUSTRALIAN LONGITUDINAL STUDY OF AGEING , 1989 .

[36]  Ya-Ming Liu,et al.  Population Aging, Technological Innovation, and the Growth of Health Expenditure: Evidence From Patients With Type 2 Diabetes in Taiwan. , 2019, Value in health regional issues.

[37]  A Likelihood-Based Approach for the Analysis of Longitudinal Clinical Trials with Return-to-Baseline Imputation , 2020 .

[38]  K. Anstey,et al.  Cohort Profile: The Australian Longitudinal Study of Ageing (ALSA). , 2016, International journal of epidemiology.

[39]  J. Ayuso-Mateos,et al.  Determinants of health and disability in ageing population: the COURAGE in Europe Project (collaborative research on ageing in Europe). , 2014, Clinical psychology & psychotherapy.

[40]  Oliver Kramer,et al.  K-Nearest Neighbors , 2013 .

[41]  Lee Hood,et al.  P4 Medicine and Scientific Wellness: Catalyzing a Revolution in 21st Century Medicine , 2017 .

[42]  A. Palloni,et al.  Cohort profile Cohort Profile : The Mexican Health and Aging Study ( MHAS ) , 2015 .

[43]  Natasa Erjavec Dummy Variables , 2011, International Encyclopedia of Statistical Science.

[44]  Heshui Shi,et al.  Temporal Changes of CT Findings in 90 Patients with COVID-19 Pneumonia: A Longitudinal Study , 2020, Radiology.

[45]  J. Brand,et al.  Missing data in a multi-item instrument were best handled by multiple imputation at the item score level. , 2014, Journal of clinical epidemiology.

[46]  Shahab Jolani,et al.  Dual imputation model for incomplete longitudinal data. , 2014, The British journal of mathematical and statistical psychology.

[47]  Nedyalko Petrov,et al.  Classifiers Accuracy Improvement Based on Missing Data Imputation , 2018, J. Artif. Intell. Soft Comput. Res..

[48]  Jason D Rights,et al.  New Recommendations on the Use of R-Squared Differences in Multilevel Model Comparisons , 2020, Multivariate behavioral research.

[49]  Song Jian,et al.  Association and interaction between triglyceride–glucose index and obesity on risk of hypertension in middle-aged and elderly adults , 2017, Clinical and experimental hypertension.

[50]  Thomas Clausen,et al.  How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data , 2019, SAGE open medicine.

[51]  M. Marmot,et al.  BMC Public Health BioMed Central Study protocol , 2006 .

[52]  Yena Lee,et al.  Machine learning and big data: Implications for disease modeling and therapeutic discovery in psychiatry , 2019, Artif. Intell. Medicine.

[53]  J. Brian Gray,et al.  Introduction to Linear Regression Analysis , 2002, Technometrics.

[54]  D. Meyer,et al.  The Demographic Representativeness and Health Outcomes of Digital Health Station Users: Longitudinal Study , 2020, Journal of medical Internet research.

[55]  F. F. Caballero,et al.  Cohort Profile Cohort Profile : The Ageing Trajectories of Health – Longitudinal Opportunities and Synergies ( ATHLOS ) project , 2019 .

[56]  F. Rodríguez‐Artalejo,et al.  [Rationale and methods of the study on nutrition and cardiovascular risk in Spain (ENRICA)]. , 2011, Revista espanola de cardiologia.

[57]  Amy E. Morgan,et al.  Cholesterol Homeostasis: An In Silico Investigation into How Aging Disrupts Its Key Hepatic Regulatory Mechanisms , 2020, Biology.

[58]  Eli M Cahan,et al.  Putting the data before the algorithm in big data addressing personalized healthcare , 2019, npj Digital Medicine.

[59]  Axel Börsch-Supan,et al.  Data Resource Profile: the Survey of Health, Ageing and Retirement in Europe (SHARE). , 2013, International journal of epidemiology.

[60]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[61]  G. Marois,et al.  Projecting health trajectories in Europe using microsimulation , 2020 .

[62]  Zhongheng Zhang,et al.  Introduction to machine learning: k-nearest neighbors. , 2016, Annals of translational medicine.

[63]  J. Ayuso-Mateos,et al.  Advanced analytical methodologies for measuring healthy ageing and its determinants, using factor analysis and machine learning techniques: the ATHLOS project , 2017, Scientific Reports.

[64]  James P. Smith,et al.  PANEL ON POLICY RESEARCH AND DATA NEEDS TO MEET THE CHALLENGE OF AGING IN ASIA , 2012 .

[65]  Chih-Fong Tsai,et al.  Missing value imputation: a review and analysis of the literature (2006–2017) , 2019, Artificial Intelligence Review.

[66]  G. Andrews,et al.  The International Year of Older Persons: putting aging and research onto the political agenda. , 1999, The journals of gerontology. Series B, Psychological sciences and social sciences.

[67]  Sandeep Kaushik,et al.  Big data in healthcare: management, analysis and future prospects , 2019, Journal of Big Data.

[68]  Eric Song,et al.  Longitudinal analyses reveal immunological misfiring in severe COVID-19 , 2020, Nature.

[69]  D. Panagiotakos,et al.  A comparison of statistical and machine-learning techniques in evaluating the association between dietary patterns and 10-year cardiometabolic risk (2002–2012): the ATTICA study , 2018, British Journal of Nutrition.

[70]  Anurika Priyanjali De Silva,et al.  Multiple imputation methods for handling missing values in a longitudinal categorical variable with restrictions on transitions over time: a simulation study , 2019, BMC Medical Research Methodology.

[71]  Masahiko Gosho,et al.  Multiple imputation for longitudinal data in the presence of heteroscedasticity between treatment groups , 2020, Journal of biopharmaceutical statistics.

[72]  Martin Slawski,et al.  On Principal Components Regression, Random Projections, and Column Subsampling , 2017, 1709.08104.