Enhancing the Human Health Status Prediction: The ATHLOS Project

Preventive healthcare is a crucial pillar of health as it contributes to staying healthy and having immediate treatment when needed. Mining knowledge from longitudinal studies has the potential to significantly contribute to the improvement of preventive healthcare. Unfortunately, data originated from such studies are characterized by high complexity, huge volume and a plethora of missing values. Machine Learning, Data Mining and Data Imputation models are utilized as part of solving the aforementioned challenges, respectively. Towards this direction, we focus on the development of a complete methodology for the ATHLOS (Ageing Trajectories of Health: Longitudinal Opportunities and Synergies) Project - funded by the European Union's Horizon 2020 Research and Innovation Program, which aims to achieve a better interpretation of the impact of aging on health. The inherent complexity of the provided dataset lie in the fact that the project includes 15 independent European and international longitudinal studies of aging. In this work, we particularly focus on the HealthStatus (HS) score, an index that estimates the human status of health, aiming to examine the effect of various data imputation models to the prediction power of classification and regression models. Our results are promising, indicating the critical importance of data imputation in enhancing preventive medicine's crucial role.

[1]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[2]  Luc Devroye,et al.  The uniform convergence of nearest neighbor regression function estimators and their application in optimization , 1978, IEEE Trans. Inf. Theory.

[3]  P. Diggle Analysis of Longitudinal Data , 1995 .

[4]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification and Regression , 1995, NIPS.

[5]  F. Windmeijer,et al.  An R-squared measure of goodness of fit for some common nonlinear regression models , 1997 .

[6]  D. Bloch,et al.  A simple method of sample size calculation for linear and logistic regression. , 1998, Statistics in medicine.

[7]  Population aging. , 1999, China population today.

[8]  H. Dodeen Effectiveness of Valid Mean Substitution in Treating Missing Data in Attitude Assessment , 2003 .

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[11]  M. Marmot,et al.  BMC Public Health BioMed Central Study protocol , 2006 .

[12]  J. Park An overview of Korean longitudinal study on health and aging , 2007 .

[13]  Werner Dubitzky,et al.  Fundamentals of Data Mining in Genomics and Proteomics , 2009 .

[14]  Hideki Hashimoto,et al.  JSTAR First Results 2009 Report , 2009 .

[15]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[16]  D. Bloom,et al.  Longitudinal Aging Study in India: Vision, Design, Implementation, and Some Early Results , 2011 .

[17]  F. Rodríguez‐Artalejo,et al.  Rationale and methods of the study on nutrition and cardiovascular risk in Spain (ENRICA) , 2011 .

[18]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[19]  Constantine Frangakis,et al.  Multiple imputation by chained equations: what is it and how does it work? , 2011, International journal of methods in psychiatric research.

[20]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[21]  F. Rodríguez‐Artalejo,et al.  [Rationale and methods of the study on nutrition and cardiovascular risk in Spain (ENRICA)]. , 2011, Revista espanola de cardiologia.

[22]  Perianayagam Arokiasamy,et al.  Data resource profile: the World Health Organization Study on global AGEing and adult health (SAGE). , 2012, International journal of epidemiology.

[23]  Perianayagam Arokiasamy,et al.  Longitudinal Aging Study in India: Vision, Design, Implementation, and Preliminary Findings * , 2012 .

[24]  Hussain Alkharusi,et al.  Categorical Variables in Regression Analysis: A Comparison of Dummy and Effect Coding , 2012 .

[25]  Andrew Steptoe,et al.  COHORT PROFILE Cohort Profile : The English Longitudinal Study of Ageing , 2014 .

[26]  G. Savva,et al.  Design and Methodology of The Irish Longitudinal Study on Ageing , 2013, Journal of the American Geriatrics Society.

[27]  Axel Börsch-Supan,et al.  Data Resource Profile: the Survey of Health, Ageing and Retirement in Europe (SHARE). , 2013, International journal of epidemiology.

[28]  Helmut Krcmar,et al.  Big Data , 2014, Wirtschaftsinf..

[29]  Shahab Jolani,et al.  Dual imputation model for incomplete longitudinal data. , 2014, The British journal of mathematical and statistical psychology.

[30]  J. Ayuso-Mateos,et al.  Determinants of health and disability in ageing population: the COURAGE in Europe Project (collaborative research on ageing in Europe). , 2014, Clinical psychology & psychotherapy.

[31]  T. Chai,et al.  Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature , 2014 .

[32]  Nina Zumel,et al.  Practical Data Science with R , 2014 .

[33]  J. Brand,et al.  Missing data in a multi-item instrument were best handled by multiple imputation at the item score level. , 2014, Journal of clinical epidemiology.

[34]  Amanda Sonnega,et al.  Cohort Profile: the Health and Retirement Study (HRS). , 2014, International journal of epidemiology.

[35]  Peter J. Hunter,et al.  Big Data, Big Knowledge: Big Data for Personalized Healthcare , 2015, IEEE Journal of Biomedical and Health Informatics.

[36]  Carmen C. Y. Poon,et al.  Big Data for Health , 2015, IEEE Journal of Biomedical and Health Informatics.

[37]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[38]  K. Anstey,et al.  Cohort Profile: The Australian Longitudinal Study of Ageing (ALSA). , 2016, International journal of epidemiology.

[39]  Zhongheng Zhang,et al.  Missing data imputation: focusing on single imputation. , 2016, Annals of translational medicine.

[40]  Nina Zumel,et al.  vtreat: a data.frame Processor for Predictive Modeling , 2016, 1611.09477.

[41]  F. De Rango,et al.  Human longevity: Genetics or Lifestyle? It takes two to tango , 2016, Immunity & Ageing.

[42]  Athanasios V. Vasilakos,et al.  Machine learning on big data: Opportunities and challenges , 2017, Neurocomputing.

[43]  Song Jian,et al.  Association and interaction between triglyceride–glucose index and obesity on risk of hypertension in middle-aged and elderly adults , 2017, Clinical and experimental hypertension.

[44]  Lee Hood,et al.  P4 Medicine and Scientific Wellness: Catalyzing a Revolution in 21st Century Medicine , 2017 .

[45]  M. Prince,et al.  Cohort profile Cohort Profile : The 10 / 66 study , 2016 .

[46]  Brian Caulfield,et al.  Automatic Prediction of Health Status Using Smartphone-Derived Behavior Profiles , 2017, IEEE Journal of Biomedical and Health Informatics.

[47]  J. Ayuso-Mateos,et al.  Advanced analytical methodologies for measuring healthy ageing and its determinants, using factor analysis and machine learning techniques: the ATHLOS project , 2017, Scientific Reports.

[48]  Martin Slawski,et al.  On Principal Components Regression, Random Projections, and Column Subsampling , 2017, 1709.08104.

[49]  A. Palloni,et al.  Cohort profile Cohort Profile : The Mexican Health and Aging Study ( MHAS ) , 2015 .

[50]  E. R. van den Heuvel,et al.  Strategies for handling missing data in longitudinal studies with questionnaires , 2018, Journal of Statistical Computation and Simulation.

[51]  Md Hamidul Huque,et al.  A comparison of multiple imputation methods for missing data in longitudinal studies , 2018, BMC Medical Research Methodology.

[52]  Jenna Wiens,et al.  Machine Learning for Healthcare: On the Verge of a Major Shift in Healthcare Epidemiology , 2018, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[53]  Nedyalko Petrov,et al.  Classifiers Accuracy Improvement Based on Missing Data Imputation , 2018, J. Artif. Intell. Soft Comput. Res..

[54]  Alzheimer's Disease Neuroimaging Initiative,et al.  Development and Validation of a Dementia Risk Prediction Model in the General Population: An Analysis of Three Longitudinal Studies. , 2019, The American journal of psychiatry.

[55]  D. Panagiotakos,et al.  A comparison of statistical and machine-learning techniques in evaluating the association between dietary patterns and 10-year cardiometabolic risk (2002–2012): the ATTICA study , 2018, British Journal of Nutrition.

[56]  Yena Lee,et al.  Machine learning and big data: Implications for disease modeling and therapeutic discovery in psychiatry , 2019, Artif. Intell. Medicine.

[57]  Anurika Priyanjali De Silva,et al.  Multiple imputation methods for handling missing values in a longitudinal categorical variable with restrictions on transitions over time: a simulation study , 2019, BMC Medical Research Methodology.

[58]  Valerio Persico,et al.  Big Data for Health , 2019, Encyclopedia of Big Data Technologies.

[59]  Masahiko Gosho,et al.  Multiple imputation for longitudinal data in the presence of heteroscedasticity between treatment groups , 2020, Journal of biopharmaceutical statistics.

[60]  Ya-Ming Liu,et al.  Population Aging, Technological Innovation, and the Growth of Health Expenditure: Evidence From Patients With Type 2 Diabetes in Taiwan. , 2019, Value in health regional issues.

[61]  Joshua C Denny,et al.  Learning from Longitudinal Data in Electronic Health Record and Genetic Data to Improve Cardiovascular Event Prediction , 2018, Scientific Reports.

[62]  Chih-Fong Tsai,et al.  Missing value imputation: a review and analysis of the literature (2006–2017) , 2019, Artificial Intelligence Review.

[63]  F. F. Caballero,et al.  Cohort Profile Cohort Profile : The Ageing Trajectories of Health – Longitudinal Opportunities and Synergies ( ATHLOS ) project , 2019 .

[64]  Thomas Clausen,et al.  How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data , 2019, SAGE open medicine.

[65]  Konstantin Eckle,et al.  A comparison of deep networks with ReLU activation function and linear spline-type methods , 2018, Neural Networks.

[66]  D. Meyer,et al.  The Demographic Representativeness and Health Outcomes of Digital Health Station Users: Longitudinal Study , 2020, Journal of medical Internet research.

[67]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[68]  Amy E. Morgan,et al.  Cholesterol Homeostasis: An In Silico Investigation into How Aging Disrupts Its Key Hepatic Regulatory Mechanisms , 2020, Biology.

[69]  P. Alam,et al.  H , 1887, High Explosives, Propellants, Pyrotechnics.

[70]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[71]  P. Alam ‘E’ , 2021, Composites Engineering: An A–Z Guide.

[72]  P. Alam ‘S’ , 2021, Composites Engineering: An A–Z Guide.