Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research

A wealth of biospecimen samples are stored in modern globally distributed biobanks. Biomedical researchers worldwide need to be able to combine the available resources to improve the power of large-scale studies. A prerequisite for this effort is to be able to search and access phenotypic, clinical and other information about samples that are currently stored at biobanks in an integrated manner. However, privacy issues together with heterogeneous information systems and the lack of agreed-upon vocabularies have made specimen searching across multiple biobanks extremely challenging. We describe three case studies where we have linked samples and sample descriptions in order to facilitate global searching of available samples for research. The use cases include the ENGAGE (European Network for Genetic and Genomic Epidemiology) consortium comprising at least 39 cohorts, the SUMMIT (surrogate markers for micro- and macro-vascular hard endpoints for innovative diabetes tools) consortium and a pilot for data integration between a Swedish clinical health registry and a biobank. We used the Sample avAILability (SAIL) method for data linking: first, created harmonised variables and then annotated and made searchable information on the number of specimens available in individual biobanks for various phenotypic categories. By operating on this categorised availability data we sidestep many obstacles related to privacy that arise when handling real values and show that harmonised and annotated records about data availability across disparate biomedical archives provide a key methodological advance in pre-analysis exchange of information between biobanks, that is, during the project planning phase.

[1]  Vincent Ferretti,et al.  Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies. , 2011, International journal of epidemiology.

[2]  P. Robinson,et al.  The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. , 2008, American journal of human genetics.

[3]  David Cox,et al.  Toward a roadmap in global biobanking for health , 2012, European Journal of Human Genetics.

[4]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[5]  J. Vandenbroucke,et al.  Practice of Epidemiology What Do Case-Control Studies Estimate? Survey of Methods and Assumptions in Published Case-Control Research , 2008 .

[6]  J. Karvanen Study Design in Causal Models , 2012, 1211.2958.

[7]  Mark I McCarthy,et al.  Data sharing in large research consortia: experiences and recommendations from ENGAGE , 2013, European Journal of Human Genetics.

[8]  Jan-Eric Litton,et al.  BIMS: An information management system for biobanking in the 21st century , 2007, IBM Syst. J..

[9]  J Kaiser Swedish bioscience. Working Sweden's population gold mine. , 2001, Science.

[10]  Mark I. McCarthy,et al.  SAIL—a software system for sample and phenotype availability across biobanks and cohorts , 2010, Bioinform..

[11]  Teri A Manolio,et al.  Genomewide association studies and assessment of the risk of disease. , 2010, The New England journal of medicine.

[12]  J. Ludvigsson,et al.  Review of 103 Swedish Healthcare Quality Registries , 2015, Journal of internal medicine.

[13]  Huaqin Pan,et al.  The PhenX Toolkit: Get the Most From Your Measures , 2011, American journal of epidemiology.

[14]  Parminder Raina,et al.  Invited commentary: consolidating data harmonization--how to obtain quality and applicability? , 2011, American journal of epidemiology.

[15]  I. Fortier,et al.  The Public Population Project in Genomics (P 3 G): a proof of , 2008 .

[16]  Jocelyn Kaiser Working Sweden's Population Gold Mine , 2001, Science.

[17]  M. Fransson,et al.  A Minimum Data Set for Sharing Biobank Samples, Information, and Data: MIABIS. , 2012, Biopreservation and biobanking.

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  Michael Ashburner,et al.  Ontologies for biologists: a community model for the annotation of genomic data. , 2003 .

[20]  Anna Zhukova,et al.  Modeling sample variables with an Experimental Factor Ontology , 2010, Bioinform..

[21]  B. Knoppers,et al.  Population Genomics: The Public Population Project in Genomics (P3G): a proof of concept? , 2008, European Journal of Human Genetics.

[22]  Guntis Barzdins,et al.  ViziQuer: A Tool to Explore and Query SPARQL Endpoints , 2011, ESWC.

[23]  Marco Brandizi,et al.  The BioSample Database (BioSD) at the European Bioinformatics Institute , 2011, Nucleic Acids Res..

[24]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[25]  Hans Hillege,et al.  Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies , 2010, International journal of epidemiology.

[26]  S. Pocock,et al.  Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration , 2007, PLoS medicine.