Data harmonization and federated analysis of population-based studies: the BioSHaRE project

AbstractsBackgroundIndividual-level data pooling of large population-based studies across research centres in international research projects faces many hurdles. The BioSHaRE (Biobank Standardisation and Harmonisation for Research Excellence in the European Union) project aims to address these issues by building a collaborative group of investigators and developing tools for data harmonization, database integration and federated data analyses.MethodsEight population-based studies in six European countries were recruited to participate in the BioSHaRE project. Through workshops, teleconferences and electronic communications, participating investigators identified a set of 96 variables targeted for harmonization to answer research questions of interest. Using each study’s questionnaires, standard operating procedures, and data dictionaries, harmonization potential was assessed. Whenever harmonization was deemed possible, processing algorithms were developed and implemented in an open-source software infrastructure to transform study-specific data into the target (i.e. harmonized) format. Harmonized datasets located on server in each research centres across Europe were interconnected through a federated database system to perform statistical analysis.ResultsRetrospective harmonization led to the generation of common format variables for 73% of matches considered (96 targeted variables across 8 studies). Authenticated investigators can now perform complex statistical analyses of harmonized datasets stored on distributed servers without actually sharing individual-level data using the DataSHIELD method.ConclusionNew Internet-based networking technologies and database management systems are providing the means to support collaborative, multi-center research in an efficient and secure manner. The results from this pilot project show that, given a strong collaborative relationship between participating studies, it is possible to seamlessly co-analyse internationally harmonized research databases while allowing each study to retain full control over individual-level data. We encourage additional collaborative research networks in epidemiology, public health, and the social sciences to make use of the open source tools presented herein.

[1]  Mark Walport,et al.  Sharing research data to improve public health , 2011, The Lancet.

[2]  Carol M Hamilton,et al.  Building a biomedical cyberinfrastructure for collaborative research. , 2011, American journal of preventive medicine.

[3]  D. Flannanghan JavaScript: The definitive guide , 1999 .

[4]  Jens Laage-Hellman,et al.  Biobanks in Europe: Prospects for Harmonisation and Networking , 2010 .

[5]  Vittorio Krogh,et al.  Methods for pooling results of epidemiologic studies: the Pooling Project of Prospective Studies of Diet and Cancer. , 2006, American journal of epidemiology.

[6]  M. Tobin,et al.  DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data , 2010, International journal of epidemiology.

[7]  A J Sutton,et al.  Meta‐analysis of individual‐ and aggregate‐level data , 2008, Statistics in medicine.

[8]  Muin J Khoury,et al.  The case for a global human genome epidemiology initiative , 2004, Nature Genetics.

[9]  C. AbouZahr,et al.  Sharing health data: good intentions are not enough. , 2010, Bulletin of the World Health Organization.

[10]  Hans Hillege,et al.  Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies , 2010, International journal of epidemiology.

[11]  J. Ioannidis,et al.  Transforming Epidemiology for 21st Century Medicine and Public Health , 2013, Cancer Epidemiology, Biomarkers & Prevention.

[12]  John P. A. Ioannidis,et al.  The Emergence of Networks in Human Genome Epidemiology: Challenges and Opportunities , 2007, Epidemiology.

[13]  Huaqin Pan,et al.  The PhenX Toolkit: Get the Most From Your Measures , 2011, American journal of epidemiology.

[14]  Xiaodong Lin,et al.  Secure, Privacy-Preserving Analysis of Distributed Databases , 2007, Technometrics.

[15]  Thomas Keil,et al.  Pooling Birth Cohorts in Allergy and Asthma: European Union-Funded Initiatives – A MeDALL, CHICOS, ENRIECO, and GA2LEN Joint Paper , 2012, International Archives of Allergy and Immunology.

[16]  Jane Kaye,et al.  Towards a data sharing Code of Conduct for international genomic research , 2011, Genome Medicine.

[17]  Paul R. Burton,et al.  DataSHIELD - shared individual-level analysis without sharing the data: a biostatistical perspective. , 2012 .

[18]  Peter Kraft,et al.  Phenotype harmonization and cross‐study collaboration in GWAS consortia: the GENEVA experience , 2011, Genetic epidemiology.

[19]  Anne E. Trefethen,et al.  Toward interoperable bioscience data , 2012, Nature Genetics.

[20]  Leena Peltonen,et al.  The federated database – a basis for biobank-based post-genome studies, integrating phenome and genome data from 600 000 twin pairs in Europe , 2007, European Journal of Human Genetics.

[21]  Nadia Minicuci,et al.  Predictors of mortality: an international comparison of socio-demographic and health characteristics from six longitudinal studies on aging: the CLESA project , 2005, Experimental Gerontology.

[22]  Alexander Thompson,et al.  Thinking big: large-scale collaborative research in observational epidemiology , 2009, European Journal of Epidemiology.

[23]  Jane Kaye,et al.  From single biobanks to international networks: developing e-governance , 2011, Human Genetics.

[24]  David Cox,et al.  Toward a roadmap in global biobanking for health , 2012, European Journal of Human Genetics.

[25]  B. Knoppers,et al.  Population Genomics: The Public Population Project in Genomics (P3G): a proof of concept? , 2008, European Journal of Human Genetics.

[26]  Murat Kantarcioglu,et al.  A secure distributed logistic regression protocol for the detection of rare adverse drug events , 2012, J. Am. Medical Informatics Assoc..

[27]  L Ribas,et al.  Comparative analysis of nutrition data from national, household, and individual levels: results from a WHO-CINDI collaborative project in Canada, Finland, Poland, and Spain* , 2003, Journal of epidemiology and community health.

[28]  Scientific International Standard Classification of Education, ISCED 1997 , 2003 .

[29]  Peter Kraft,et al.  Gene‐environment interplay in common complex diseases: forging an integrative model—recommendations from an NIH workshop , 2011, Genetic epidemiology.

[30]  Vincent Ferretti,et al.  Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies. , 2011, International journal of epidemiology.

[31]  Jan-Eric Litton,et al.  Biobanking for Europe , 2007, Briefings Bioinform..

[32]  M. Obin,et al.  'Metabolically healthy obesity': origins and implications. , 2013, Molecular aspects of medicine.

[33]  A. Karelis,et al.  Metabolically healthy but obese individuals , 2008, The Lancet.

[34]  Jerome P. Reiter,et al.  Data Dissemination and Disclosure Limitation in a World Without Microdata: A Risk-Utility Framework for Remote Access Analysis Servers , 2005 .

[35]  Peter A. Bath,et al.  The harmonisation of longitudinal data: a case study using data from cohort studies in The Netherlands and the United Kingdom , 2010, Ageing and Society.

[36]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[37]  Mark I McCarthy,et al.  Data sharing in large research consortia: experiences and recommendations from ENGAGE , 2013, European Journal of Human Genetics.

[38]  R. Lyons,et al.  The SAIL Databank: building a national architecture for e-health research and evaluation , 2009, BMC health services research.

[39]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[40]  Andrew J Vickers,et al.  Making raw data more widely available , 2011, BMJ : British Medical Journal.

[41]  Parminder Raina,et al.  Facilitating collaborative research: Implementing a platform supporting data harmonization and pooling , 2012 .