Recovery of original individual person data (IPD) inferences from empirical IPD summaries only: Applications to distributed computing under disclosure constraints

There are many settings where individual person data (IPD) are not available, due to privacy or technical reasons, and one must work with IPD proxies, such as summary statistics, to approximate original IPD inferences, that is, the results of statistical analyses that would ideally have been performed on individual-level data. For instance, in a distributed computing setting, as implemented in the DataSHIELD software framework, different centers can only share IPD proxies to obtain pooled IPD inferences. Such privacy requirements limit the scope of statistical investigation. For example, it can be challenging to perform between-center random-effect regression models. To increase modeling freedom we propose a method that only uses simple nondisclosive summaries of the original IPD as input, such as empirical marginal moments and correlation matrices, and generates artificial data compatible with those summary features. Specifically, data are generated from a Gaussian copula with marginal and joint components specified by the above summaries. The goal is to reproduce original IPD features in the artificial data, such that original IPD inferences are recovered from the artificial data. In an application example, and through simulations, we show that we can recover estimates of a multivariable IPD random-effect logistic regression, from artificial data generated via the Gaussian copula using the above IPD summaries, suggesting the proposed approach provides a generally applicable strategy for distributed computing settings with data protection constraints.

[1]  George F Borm,et al.  The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method , 2014, BMC Medical Research Methodology.

[2]  Harlan M Krumholz,et al.  Increasing value and reducing waste: addressing inaccessible research , 2014, The Lancet.

[3]  Sengwee Toh,et al.  Multivariable confounding adjustment in distributed data networks without sharing of patient‐level data , 2013, Pharmacoepidemiology and drug safety.

[4]  Hongzhe Li,et al.  A Gaussian copula approach for the analysis of secondary phenotypes in case-control genetic association studies. , 2012, Biostatistics.

[5]  R. Clemen,et al.  Correlations and Copulas for Decision and Risk Analysis , 1999 .

[6]  Dimitris Karlis,et al.  Multivariate logit copula model with an application to dental data , 2008, Statistics in medicine.

[7]  Nicky J Welton,et al.  Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves , 2012, BMC Medical Research Methodology.

[8]  D. Altman,et al.  Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers , 2010, BMJ : British Medical Journal.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Adam Kingl Voices: Why Cross‐Company Dialogue Delivers , 2013 .

[11]  Olivier P. Faugeras,et al.  Maximal coupling of empirical copulas for discrete vectors , 2015, J. Multivar. Anal..

[12]  Douglas J. Miller,et al.  On the recovery of joint distributions from limited information , 2002 .

[13]  Bruce H Fireman,et al.  Confounding Adjustment in Comparative Effectiveness Research Conducted Within Distributed Research Networks , 2013, Medical care.

[14]  Stefan Wager,et al.  Teaching Statistics at Google-Scale , 2015 .

[15]  Stephen R Cole,et al.  Comparison of Methods to Generalize Randomized Clinical Trial Results Without Individual-Level Data for the Target Population , 2018, American journal of epidemiology.

[16]  A. Vickers Whose data set is it anyway? Sharing raw data from randomized trials , 2006, Trials.

[17]  Mia Gallagher,et al.  Validity of Privacy-Protecting Analytical Methods That Use Only Aggregate-Level Information to Conduct Multivariable-Adjusted Analysis in Distributed Data Networks , 2018, American journal of epidemiology.

[18]  Roger D. Peng,et al.  The reproducibility crisis in science: A statistical counterattack , 2015 .

[19]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[20]  Oliver Butters,et al.  DataSHIELD: taking the analysis to the data, not the data to the analysis , 2014, International journal of epidemiology.

[21]  Victoria Stodden,et al.  Reproducing Statistical Results , 2015 .

[22]  Nils Lid Hjort,et al.  Model Selection and Model Averaging: Contents , 2008 .

[23]  David Moher,et al.  SPIRIT 2013 explanation and elaboration: guidance for protocols of clinical trials , 2013, BMJ.

[24]  C. Genest,et al.  Everything You Always Wanted to Know about Copula Modeling but Were Afraid to Ask , 2007 .

[25]  Shane G. Henderson,et al.  Chessboard Distributions and Random Vectors with Specified Marginals and Covariance Matrix , 2002, Oper. Res..

[26]  Andrew J. Patton A review of copula models for economic time series , 2012, J. Multivar. Anal..

[27]  Shane G. Henderson,et al.  Behavior of the NORTA method for correlated random vector generation as the dimension increases , 2003, TOMC.

[28]  I. D. Hill,et al.  Fitting Johnson Curves by Moments , 1976 .

[29]  J. Hanley,et al.  Recovering the raw data behind a non-parametric survival curve , 2014, Systematic Reviews.

[30]  Raghu Kacker,et al.  Random-effects model for meta-analysis of clinical trials: an update. , 2007, Contemporary clinical trials.

[31]  Roy Pardee,et al.  Combining distributed regression and propensity scores: a doubly privacy-protecting analytic method for multicenter research , 2018, Clinical epidemiology.

[32]  Joseph L. Hammond,et al.  Generation of Pseudorandom Numbers with Specified Univariate Distributions and Correlation Coefficients , 1975, IEEE Transactions on Systems, Man, and Cybernetics.

[33]  Caroline Leigh Watkins The International Stroke Trial (IST): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19 435 patients with acute ischaemic stroke , 1997 .

[34]  M. Piedmonte,et al.  A Method for Generating High-Dimensional Multivariate Binary Variates , 1991 .

[35]  I. D. Hill Algorithm AS 100: Normal-Johnson and Johnson-Normal Transformations , 1976 .