The Problem of Fairness in Synthetic Healthcare Data

Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets.

[1]  A. K. Das,et al.  Quantifying representativeness in randomized clinical trials using machine learning fairness metrics , 2021, medRxiv.

[2]  Deepak Bhatt,et al.  Transitioning from Real to Synthetic data: Quantifying the bias in model , 2021, ArXiv.

[3]  Marzyeh Ghassemi,et al.  Can You Fake It Until You Make It?: Impacts of Differentially Private Synthetic Data on Downstream Classification Fairness , 2021, FAccT.

[4]  K. El Emam,et al.  Evaluating the utility of synthetic COVID-19 case data , 2021, JAMIA open.

[5]  Isabelle Guyon,et al.  Quantifying Resemblance of Synthetic Medical Time-Series , 2021, ESANN.

[6]  Joel Weijia Lai,et al.  Superposition of COVID‐19 waves, anticipating a sustained wave, and lessons for the future , 2020, BioEssays : news and reviews in molecular, cellular and developmental biology.

[7]  Joel Weijia Lai,et al.  Relieving Cost of Epidemic by Parrondo's Paradox: A COVID‐19 Case Study , 2020, Advanced science.

[8]  Don Bambino Geno Tai,et al.  The Disproportionate Impact of COVID-19 on Racial and Ethnic Minorities in the United States , 2020, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[9]  Isabelle Guyon,et al.  Generation and evaluation of privacy preserving synthetic health data , 2020, Neurocomputing.

[10]  Uday Pratap Singh,et al.  Stock Market Forecasting Using Computational Intelligence: A Survey , 2020, Archives of Computational Methods in Engineering.

[11]  Pratyush Garg,et al.  Fairness Metrics: A Comparative Analysis , 2020, 2020 IEEE International Conference on Big Data (Big Data).

[12]  Andrew Yale Privacy preserving synthetic health data generation and evaluation , 2020 .

[13]  Mark Payne,et al.  Health and Human Services , 2020, Congress and the Nation 2013-2016, Volume XIV: Politics and Policy in the 113th and 114th Congresses.

[14]  Isabelle Guyon,et al.  Synthesizing Quality Open Data Assets from Private Health Research Studies , 2020, BIS.

[15]  Ritik Dutta,et al.  Synthetic Event Time Series Health Data Generation , 2019, ArXiv.

[16]  Faisal Farooq,et al.  A Robust Framework for Accelerated Outcome-driven Risk Factor Identification from EHR , 2019, KDD.

[17]  Kush R. Varshney,et al.  Fairness GAN , 2018, IBM Journal of Research and Development.

[18]  Deborah L McGuinness,et al.  Clustering of co‐occurring conditions in autism spectrum disorder during early childhood: A retrospective analysis of medical claims data , 2019, Autism research : official journal of the International Society for Autism Research.

[19]  Hugh P. Levaux,et al.  Use of EHRs data for clinical research: Historical progress and current applications , 2019, Learning health systems.

[20]  Aaron Roth,et al.  Differentially Private Fair Learning , 2018, ICML.

[21]  Ben Hutchinson,et al.  50 Years of Test (Un)fairness: Lessons for Machine Learning , 2018, FAT.

[22]  Rachel K. E. Bellamy,et al.  AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias , 2018, ArXiv.

[23]  Peter Cooman,et al.  Evaluating Fairness Metrics in the Presence of Dataset Bias , 2018, ArXiv.

[24]  Sharad Goel,et al.  The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning , 2018, ArXiv.

[25]  Stephen J Mooney,et al.  Big Data in Public Health: Terminology, Machine Learning, and Privacy. , 2018, Annual review of public health.

[26]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[27]  Patrick B. Ryan,et al.  The representativeness of eligible patients in type 2 diabetes trials: a case study using GIST 2.0 , 2017, J. Am. Medical Informatics Assoc..

[28]  Somesh Jha,et al.  Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting , 2017, 2018 IEEE 31st Computer Security Foundations Symposium (CSF).

[29]  Pratik Gajane,et al.  On formalizing fairness in prediction with machine learning , 2017, ArXiv.

[30]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[31]  Jon M. Kleinberg,et al.  Inherent Trade-Offs in the Fair Determination of Risk Scores , 2016, ITCS.

[32]  Fiona M. Callaghan,et al.  Use of Electronic Health Record Data to Evaluate the Impact of Race on 30-Day Mortality in Patients Admitted to the Intensive Care Unit , 2017, Journal of Racial and Ethnic Health Disparities.

[33]  Shuang Wang,et al.  GIST 2.0: A scalable multi-trait metric for quantifying population representativeness of individual clinical studies , 2016, J. Biomed. Informatics.

[34]  Fleur Fritz,et al.  Electronic health records to facilitate clinical research , 2016, Clinical Research in Cardiology.

[35]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[36]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[37]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[38]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[39]  Michael Lin,et al.  Synthetic Data , 2009, Encyclopedia of Database Systems.

[40]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[41]  Harlan M Krumholz,et al.  Participation in cancer clinical trials: race-, sex-, and age-based disparities. , 2004, JAMA.

[42]  Barbara Blechner,et al.  Health Insurance Portability and Accountability Act of 1996 (HIPAA): a provider's overview of new privacy regulations. , 2002, Connecticut medicine.

[43]  J. Stoker,et al.  The Department of Health and Human Services. , 1999, Home healthcare nurse.