Optimizing the synthesis of clinical trial data using sequential trees

Abstract Objective With the growing demand for sharing clinical trial data, scalable methods to enable privacy protective access to high-utility data are needed. Data synthesis is one such method. Sequential trees are commonly used to synthesize health data. It is hypothesized that the utility of the generated data is dependent on the variable order. No assessments of the impact of variable order on synthesized clinical trial data have been performed thus far. Through simulation, we aim to evaluate the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implement an optimization algorithm to find a good order if variability is too high. Materials and Methods Six oncology clinical trial datasets were evaluated in a simulation. Three utility metrics were computed comparing real and synthetic data: univariate similarity, similarity in multivariate prediction accuracy, and a distinguishability metric. Particle swarm was implemented to optimize variable order, and was compared with a curriculum learning approach to ordering variables. Results As the number of variables in a clinical trial dataset increases, there is a pattern of a marked increase in variability of data utility with order. Particle swarm with a distinguishability hinge loss ensured adequate utility across all 6 datasets. The hinge threshold was selected to avoid overfitting which can create a privacy problem. This was superior to curriculum learning in terms of utility. Conclusions The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets.

[1]  Simon Hediger,et al.  On the use of random forest for two-sample testing , 2019, Comput. Stat. Data Anal..

[2]  Bernhard Pfahringer,et al.  Classifier Chains: A Review and Perspectives , 2019, J. Artif. Intell. Res..

[3]  K. El Emam Seven Ways to Evaluate the Utility of Synthetic Data , 2020, IEEE Security & Privacy.

[4]  Linda Coyle,et al.  Generation and evaluation of synthetic patient data , 2020, BMC Medical Research Methodology.

[5]  Ziqi Zhang,et al.  Generating Electronic Health Records with Multiple Data Types and Constraints , 2020, AMIA.

[6]  Daniel S Quintana,et al.  A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation , 2020, eLife.

[7]  E. Akl,et al.  Obtaining and managing data sets for individual participant data meta-analysis: scoping review and practical guide , 2019, BMC Medical Research Methodology.

[8]  J. Prchal,et al.  Aberrant expression of microRNA in polycythemia vera , 2008, Haematologica.

[9]  Thomas Sutter,et al.  Generation of Heterogeneous Synthetic Electronic Health Records using GANs , 2019, NeurIPS 2019.

[10]  S. Nevitt,et al.  European Medicines Agency Policy 0070: an exploratory review of data utility in clinical study reports for academic research , 2019, BMC Medical Research Methodology.

[11]  Yi Feng,et al.  The Promise and Limitations of Synthetic Data as a Strategy to Expand Access to State-Level Multi-Agency Longitudinal Data , 2019, Journal of Research on Educational Effectiveness.

[12]  Allan Tucker,et al.  Generating and Evaluating Synthetic UK Primary Care Data: Preserving Data Utility & Patient Privacy , 2019, 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS).

[13]  Jingchen Hu,et al.  Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data , 2018, Trans. Data Priv..

[14]  Latanya Sweeney,et al.  Saying it’s Anonymous Doesn't Make It So: Re-identifications of “anonymized” law school data , 2018 .

[15]  Latanya Sweeney,et al.  Risks to Patient Privacy: A Re-identification of Patients in Maine and Vermont Statewide Hospital Data , 2018 .

[16]  Josep Domingo-Ferrer,et al.  On the Privacy Guarantees of Synthetic Data: A Reassessment from the Maximum-Knowledge Attacker Perspective , 2018, PSD.

[17]  Maria Pampaka,et al.  Differential Correct Attribution Probability for Synthetic Data: An Exploration , 2018, PSD.

[18]  Ruben C. Arslan,et al.  Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior. , 2018, Journal of personality and social psychology.

[19]  Sushil Jajodia,et al.  Data Synthesis based on Generative Adversarial Networks , 2018, Proc. VLDB Endow..

[20]  I. Salama,et al.  Artificial Intelligence in Health Care , 2018 .

[21]  David Moher,et al.  Data sharing and reanalysis of randomized controlled trials in leading biomedical journals with a full data sharing policy: survey of studies published in The BMJ and PLOS Medicine , 2018, British Medical Journal.

[22]  Joshua R Polanin,et al.  Efforts to retrieve individual participant data sets for use in a meta-analysis result in moderate data sharing but many data sets remain missing. , 2017, Journal of clinical epidemiology.

[23]  Zhiwei Steven Wu,et al.  Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing , 2017, bioRxiv.

[24]  Laurie Harris,et al.  Overcoming Small Data Limitations in Heart Disease Prediction by Using Surrogate Data , 2018 .

[25]  G. Raab,et al.  Guidelines for Producing Useful Synthetic Data , 2017, 1712.04078.

[26]  Bill Howe,et al.  Synthetic Data for Social Good , 2017, ArXiv.

[27]  Gillian M. Raab,et al.  Providing bespoke synthetic data for the UK Longitudinal Studies and other sensitive data with the synthpop package for R1 , 2017 .

[28]  Zbigniew Michalewicz,et al.  Particle Swarm Optimization for Single Objective Continuous Space Problems: A Review , 2017, Evolutionary Computation.

[29]  Ji Su Yoo,et al.  Re-identification Risks in HIPAA Safe Harbor Data: A study of data from one environmental health study. , 2017, Technology science.

[30]  Gillian M. Raab,et al.  Practical Data Synthesis for Large Samples , 2018, J. Priv. Confidentiality.

[31]  Joshua Snoke,et al.  General and specific utility measures for synthetic data , 2016, 1604.06651.

[32]  H. Bauchner,et al.  Sharing Clinical Trial Data: A Proposal from the International Committee of Medical Journal Editors , 2016, PLoS medicine.

[33]  Michael J. Pencina,et al.  Use of Open Access Platforms for Clinical Trial Data. , 2016, JAMA.

[34]  H. Bauchner,et al.  Sharing Clinical Trial Data: A Proposal from the International Committee of Medical Journal Editors , 2016, Ethiopian journal of health sciences.

[35]  Jerome P. Reiter,et al.  Releasing synthetic magnitude microdata constrained to fixed marginal totals , 2016 .

[36]  P. Ravaud,et al.  Feasibility of individual patient data meta-analyses in orthopaedic surgery , 2015, BMC Medicine.

[37]  B. Lo Sharing clinical trial data: maximizing benefits, minimizing risk. , 2015, JAMA.

[38]  Y. de Montjoye,et al.  Unique in the shopping mall: On the reidentifiability of credit card metadata , 2015, Science.

[39]  B. Nowok Utility of synthetic microdata generated using tree-based methods , 2015 .

[40]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[41]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .

[42]  Jerome P. Reiter,et al.  Disclosure Risk Evaluation for Fully Synthetic Categorical Data , 2014, Privacy in Statistical Databases.

[43]  Kristian Thorlund,et al.  Reanalyses of randomized clinical trial data. , 2014, JAMA.

[44]  Peggy Eastman,et al.  IOM Attempting to Set Principles for Responsible Clinical Trial Data Sharing , 2014 .

[45]  Lane F Burgette,et al.  A tutorial on propensity score estimation for multiple treatments using generalized boosted models , 2013, Statistics in medicine.

[46]  S. Faivre,et al.  Cisplatin and fluorouracil with or without panitumumab in patients with recurrent or metastatic squamous-cell carcinoma of the head and neck (SPECTRUM): an open-label phase 3 randomised trial. , 2013, The Lancet. Oncology.

[47]  Latanya Sweeney,et al.  Matching Known Patients to Health Records in Washington State Data , 2013, ArXiv.

[48]  César A. Hidalgo,et al.  Unique in the Crowd: The privacy bounds of human mobility , 2013, Scientific Reports.

[49]  Khaled El Emam,et al.  A Review of Evidence on Consent Bias in Research , 2013, The American journal of bioethics : AJOB.

[50]  Grigorios Tsoumakas,et al.  Multi-target regression via input space expansion: treating targets as inputs , 2012, Machine Learning.

[51]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[52]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[53]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[54]  Claudio Conversano,et al.  Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering , 2009, J. Classif..

[55]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[56]  Anna Oganian,et al.  Global Measures of Data Utility for Microdata Masked for Disclosure Limitation , 2009, J. Priv. Confidentiality.

[57]  Daniel J. Freeman,et al.  Wild-type KRAS is required for panitumumab efficacy in patients with metastatic colorectal cancer. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[58]  Riccardo Poli,et al.  Analysis of the publications on the applications of particle swarm optimisation , 2008 .

[59]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[60]  Marc Peeters,et al.  Open-label phase III trial of panitumumab plus best supportive care compared with best supportive care alone in patients with chemotherapy-refractory metastatic colorectal cancer. , 2007, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[61]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[62]  A. Karr,et al.  Data swapping as a decision problem , 2005 .

[63]  P. Qiu The Statistical Evaluation of Medical Tests for Classification and Prediction , 2005 .

[64]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[65]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[66]  Jerome P. Reiter,et al.  New Approaches to Data Dissemination: A Glimpse into the Future (?) , 2004 .

[67]  Lorenzo Rosasco,et al.  Are Loss Functions All the Same? , 2004, Neural Computation.

[68]  J. Friedman On Multivariate Goodness-of-Fit and Two-Sample Testing , 2004 .

[69]  T. N. Sriram Asymptotics in Statistics–Some Basic Concepts , 2002 .

[70]  A. Kovatich,et al.  CD117: a sensitive marker for gastrointestinal stromal tumors that is more specific than CD34. , 1998, Modern pathology : an official journal of the United States and Canadian Academy of Pathology, Inc.

[71]  D. Wallace The Yale School of Medicine , 1934, Science.