Synthetic Patient Data Generation and Evaluation in Disease Prediction Using Small and Imbalanced Datasets

The increasing prevalence of chronic non-communicable diseases makes it a priority to develop tools for enhancing their management. On this matter, Artificial Intelligence algorithms have proven to be successful in early diagnosis, prediction and analysis in the medical field. Nonetheless, two main issues arise when dealing with medical data: lack of high-fidelity datasets and maintenance of patient's privacy. To face these problems, different techniques of synthetic data generation have emerged as a possible solution. In this work, a framework based on synthetic data generation algorithms was developed. Eight medical datasets containing tabular data were used to test this framework. Three different statistical metrics were used to analyze the preservation of synthetic data integrity and six different synthetic data generation sizes were tested. Besides, the generated synthetic datasets were used to train four different supervised Machine Learning classifiers alone, and also combined with the real data. F1-score was used to evaluate classification performance. The main goal of this work is to assess the feasibility of the use of synthetic data generation in medical data in two ways: preservation of data integrity and maintenance of classification performance.

[1]  Debbie Rankin,et al.  Synthetic data generation for tabular health records: A systematic review , 2022, Neurocomputing.

[2]  E. Konstantinidis,et al.  Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain , 2022, Electronics.

[3]  Ming Y. Lu,et al.  Synthetic data in machine learning for medicine and healthcare , 2021, Nature Biomedical Engineering.

[4]  Dolf Trieschnigg,et al.  Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records , 2021, Future Internet.

[5]  Sébastien Gambs,et al.  Growing synthetic data through differentially-private vine copulas , 2021, Proc. Priv. Enhancing Technol..

[6]  A. Tucker,et al.  Generating and evaluating cross‐sectional synthetic electronic healthcare data: Preserving data utility and patient privacy , 2021, Comput. Intell..

[7]  K. El Emam,et al.  Evaluating the utility of synthetic COVID-19 case data , 2021, JAMIA open.

[8]  Dhamanpreet Kaur,et al.  Application of Bayesian networks to generate synthetic health data , 2020, J. Am. Medical Informatics Assoc..

[9]  G. Callicó,et al.  Analysis of Risk Factors in Dementia Through Machine Learning. , 2020, Journal of Alzheimer's disease : JAD.

[10]  K. B. Letaief,et al.  Precision medicine in the era of artificial intelligence: implications in chronic disease management , 2020, Journal of Translational Medicine.

[11]  Khaled El Emam,et al.  Optimizing the synthesis of clinical trial data using sequential trees , 2020, J. Am. Medical Informatics Assoc..

[12]  Isabel A. Nepomuceno-Chamorro,et al.  Generation of Synthetic Data with Conditional Generative Adversarial Networks , 2020, Log. J. IGPL.

[13]  Puja Myles,et al.  Generating high-fidelity synthetic patient data for assessing machine learning healthcare software , 2020, npj Digital Medicine.

[14]  Poonam Chaudhari,et al.  Data augmentation using MG-GAN for improved cancer classification on gene expression data , 2019, Soft Computing.

[15]  Li Yang,et al.  On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice , 2020, Neurocomputing.

[16]  Linda Coyle,et al.  Generation and evaluation of synthetic patient data , 2020, BMC Medical Research Methodology.

[17]  M. Pencina,et al.  Prediction Models - Development, Evaluation, and Clinical Application. , 2020, The New England journal of medicine.

[18]  Fei Wang,et al.  Should Health Care Demand Interpretable Artificial Intelligence or Accept “Black Box” Medicine? , 2019, Annals of Internal Medicine.

[19]  Lei Xu,et al.  Modeling Tabular data using Conditional GAN , 2019, NeurIPS.

[20]  Yimin Zhou,et al.  A Review: Generative Adversarial Networks , 2019, 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA).

[21]  T. Davenport,et al.  The potential for artificial intelligence in healthcare , 2019, Future Healthcare Journal.

[22]  Touhid Bhuiyan,et al.  Dataset on significant risk factors for Type 1 Diabetes: A Bangladeshi perspective , 2018, Data in brief.

[23]  Isaac S Kohane,et al.  Artificial Intelligence in Healthcare , 2019, Artificial Intelligence and Machine Learning for Business for Non-Engineers.

[24]  Jeffrey L. Gunter,et al.  Medical Image Synthesis for Data Augmentation and Anonymization using Generative Adversarial Networks , 2018, SASHIMI@MICCAI.

[25]  Yang Yue,et al.  Synthetic Data Approach for Classification and Regression , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[26]  Bin Yang,et al.  MedGAN: Medical Image Translation using GANs , 2018, Comput. Medical Imaging Graph..

[27]  Hayit Greenspan,et al.  Synthetic data augmentation using GAN for improved liver lesion classification , 2018, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[28]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, Inf. Sci..

[29]  H. Niu,et al.  Prevalence and incidence of Alzheimer's disease in Europe: A meta-analysis. , 2017, Neurologia.

[30]  Gillian M. Raab,et al.  Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R1 , 2017 .

[31]  Zhong Liu,et al.  A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM , 2017, Comput. Intell. Neurosci..

[32]  Alexander J. Smola,et al.  Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy , 2016, ICLR.

[33]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[34]  P. Groenewegen,et al.  Will the trilogue on the EU Data Protection Regulation recognise the importance of health research? , 2015, European journal of public health.

[35]  Joydeep Ghosh,et al.  PeGS: Perturbed Gibbs Samplers that Generate Privacy-Compliant Synthetic Data , 2014, Trans. Data Priv..

[36]  Aaron C. Courville,et al.  Generative Adversarial Networks , 2014, 1406.2661.

[37]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[38]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[39]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[40]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[41]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[42]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[43]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[44]  P. X. Song,et al.  Multivariate Dispersion Models Generated From Gaussian Copula , 2000 .

[45]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[46]  P. Bennett,et al.  Diabetes mellitus in American (Pima) Indians. , 1971, Lancet.

[47]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[48]  C. Dolea,et al.  World Health Organization , 1949, International Organization.

[49]  Richard Barnett Diabetes , 1904, The Lancet.

[50]  Niva Mohapatra,et al.  Optimization of the Random Forest Algorithm , 2020 .

[51]  Oliver Kramer,et al.  K-Nearest Neighbors , 2013 .

[52]  Jianming Wang,et al.  Maximum F1-Score Discriminative Training for Automatic Mispronunciation Detection in Computer-Assisted Language Learning , 2012, INTERSPEECH.

[53]  Michael Lin,et al.  Synthetic Data , 2009, Encyclopedia of Database Systems.

[54]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[55]  L. Breiman Random Forests , 2001, Machine Learning.

[56]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[57]  M. Hardy Regression with dummy variables , 1993 .