DataSHIELD: taking the analysis to the data, not the data to the analysis

Background: Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK’s proposed ‘care.data’ initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data. Methods: Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC. Results: Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach. Conclusions: DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property—the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.

Oliver Butters | Markus Perola | Vincent Ferretti | Bartha M Knoppers | Chris Dibben | Melanie Waldenberger | Jennifer R Harris | Ronald P Stolk | Susan E Wallace | Edwin van den Heuvel | Andrew Turner | Paul R Burton | Tero Hiekkalinna | Jennifer R. Harris | Frank Popham | Isabelle Budin-Ljøsne | Isabel Fortier | Kim W Carter | Kristian Hveem | Mathieu Boniol | Julia Isaeva | Dany Doiron | Gillian Raab | Nuala Sheehan | Kirsti Kvaløy | John Macleod | E. R. van den Heuvel | G. Raab | M. Waldenberger | M. Perola | N. Sheehan | P. Burton | V. Ferretti | B. Knoppers | I. Fortier | Marja-Liisa Nuotio | K. Hveem | S. Wallace | I. Budin-Ljøsne | K. Carter | R. Stolk | K. Kvaløy | Philippe Laflamme | J. Macleod | E. Reischl | I. Perry | R. Francis | S. Millar | Y. Marcon | T. Hiekkalinna | M. Murtagh | C. Dibben | J. Isaeva | D. Doiron | Amadou Gaye | F. Popham | Joel Minion | A. Boyd | M. Boniol | Oliver W Butters | Andrew Turner | I. Demir | Ivan J Perry | Eva Reischl | Rebecca Wilson | Amadou Gaye | E. Jones | Christopher Newby | Rebecca C Wilson | B. Murtagh | Lisette Giepmans | Carsten Oliver Schmidt | P. Boffetta | M. Bota | N. deKlerk | Annette Peters | C. Phillips | Bruce HR Woffenbuttel | Richard W Francis | Madeleine J Murtagh | Andrew W Boyd | Joel T Minion | Maria Bota | Ipek Demir | Marja-Liisa Nuotio | Yannick Marcon | Philippe LaFlamme | Elinor M Jones | Joel Minion | Christopher J Newby | Barnaby Murtagh | Lisette Giepmans | Carsten Oliver Schmidt | Paolo Boffetta | Nick deKlerk | Sean Millar | Annette Peters | Catherine M Phillips | J. Harris | Oliver W. Butters | Rebecca C. Wilson | Barnaby Murtagh | Paolo Boffetta | Vincent Ferretti

[1]  J. Lei,et al.  Combining multiple healthcare databases for postmarketing drug and vaccine safety surveillance: why and how? , 2014, Journal of internal medicine.

[2]  Patricia Langenberg,et al.  Meta-Analysis, Decision Analysis, and Cost Effectiveness Analysis: Methods for Quantitative Synthesis in Medicine, 2nd Edition , 2000 .

[3]  S. Wallace The Needle in the Haystack: International Consortia and the Return of Individual Research Results , 2011, Journal of Law, Medicine & Ethics.

[4]  Susan E Wallace,et al.  Protecting Personal Data in Epidemiological Research: DataSHIELD and UK Law , 2014, Public Health Genomics.

[5]  M. McCarthy,et al.  Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes , 2007, Science.

[6]  F J Ingelfinger,et al.  International Journal of Epidemiology , 1973, The New England journal of medicine.

[7]  A. Rajadhyaksha Archive , 2008, BioScope: South Asian Screen Studies.

[8]  Paul R. Burton,et al.  DataSHIELD - shared individual-level analysis without sharing the data: a biostatistical perspective. , 2012 .

[9]  Erik Bongcam-Rudloff,et al.  The Pan-European research infrastructure for Biobanking and Biomolecular Resources: managing resources for the future of biomedical research , 2009 .

[10]  M. Tobin,et al.  DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data , 2010, International journal of epidemiology.

[11]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[12]  N. Breslow,et al.  Approximate inference in generalized linear mixed models , 1993 .

[13]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[14]  Muin J. Khoury,et al.  Quantifying realistic sample size requirements for human genome epidemiology , 2008 .

[15]  A. Kuk,et al.  The monte carlo newton-raphson algorithm , 1997 .

[16]  D. Postma,et al.  Universal risk factors for multifactorial diseases , 2007, European Journal of Epidemiology.

[17]  Jan-Eric Litton,et al.  The Bio-PIN: a concept to improve biobanking , 2011, Nature Reviews Cancer.

[18]  P. Kearney,et al.  Cohort profile: The Cork and Kerry Diabetes and Heart Disease Study. , 2013, International journal of epidemiology.

[19]  Margaret McCartney Care.data: why are Scotland and Wales doing it differently? , 2014, BMJ : British Medical Journal.

[20]  Bruce Alberts,et al.  Making Data Maximally Available , 2011, Science.

[21]  Diana B. Petitti,et al.  Meta-Analysis, Decision Analysis, and Cost-Effectiveness Analysis: Methods for Quantitative Synthesis in Medicine , 1994 .

[22]  P. O’Reilly,et al.  Genome-wide association study identifies eight loci associated with blood pressure , 2009, Nature Genetics.

[23]  Vincent Ferretti,et al.  Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies. , 2011, International journal of epidemiology.

[24]  Paul R Burton,et al.  Key concepts in genetic epidemiology , 2005, The Lancet.

[25]  Mark Walport,et al.  Sharing research data to improve public health , 2011, The Lancet.

[26]  Markus Perola,et al.  Data harmonization and federated analysis of population-based studies: the BioSHaRE project , 2013, Emerging Themes in Epidemiology.

[27]  G. Tebala What is the future of biomedical research? , 2015, Medical hypotheses.

[28]  Jerome P. Reiter,et al.  Data Dissemination and Disclosure Limitation in a World Without Microdata: A Risk-Utility Framework for Remote Access Analysis Servers , 2005 .

[29]  Hans Hillege,et al.  Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies , 2010, International journal of epidemiology.

[30]  P. Donnelly,et al.  Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip , 2009, PLoS genetics.

[31]  C. Gieger,et al.  KORA-gen - Resource for Population Genetics, Controls and a Broad Spectrum of Disease Phenotypes , 2005 .

[32]  P. Burton,et al.  Securing the Data Economy: Translating Privacy and Enacting Security in the Development of DataSHIELD , 2012, Public Health Genomics.

[33]  Paul R. Burton,et al.  Combined analysis of correlated data when data cannot be pooled , 2013 .

[34]  Grinding to a halt: the effects of the increasing regulatory burden on research and quality improvement efforts. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[35]  M. Murtagh,et al.  Data sharing across biobanks: epistemic values, data mutability and data incommensurability , 2013 .

[36]  D.,et al.  Regression Models and Life-Tables , 2022 .

[37]  C. Power,et al.  Cohort profile: 1958 British birth cohort (National Child Development Study). , 2006, International journal of epidemiology.

[38]  J. Hoeksma The NHS’s care.data scheme: what are the risks to privacy? , 2014, BMJ : British Medical Journal.

[39]  R. Lyons,et al.  The SAIL Databank: building a national architecture for e-health research and evaluation , 2009, BMC health services research.

[40]  P. Burton,et al.  Extending the simple linear regression model to account for correlated responses: an introduction to generalized estimating equations and multi-level mixed modelling. , 1998, Statistics in medicine.

[41]  Inês Barroso,et al.  Genome-wide association study identifies five loci associated with lung function , 2010, Nature Genetics.

[42]  H. Goldstein Multilevel mixed linear model analysis using iterative generalized least squares , 1986 .

[43]  Using patient‐identifiable data for epidemiological research , 2004, Transfusion medicine.

[44]  L. Cardon,et al.  Designing candidate gene and genome-wide case–control association studies , 2007, Nature Protocols.

[45]  A J Sutton,et al.  Meta‐analysis of individual‐ and aggregate‐level data , 2008, Statistics in medicine.