Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies

Background Vast sample sizes are often essential in the quest to disentangle the complex interplay of the genetic, lifestyle, environmental and social factors that determine the aetiology and progression of chronic diseases. The pooling of information between studies is therefore of central importance to contemporary bioscience. However, there are many technical, ethico-legal and scientific challenges to be overcome if an effective, valid, pooled analysis is to be achieved. Perhaps most critically, any data that are to be analysed in this way must be adequately ‘harmonized’. This implies that the collection and recording of information and data must be done in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place. Methods This conceptual article describes the origins, purpose and scientific foundations of the DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research; http://www.datashaper.org), which has been created by a multidisciplinary consortium of experts that was pulled together and coordinated by three international organizations: P3G (Public Population Project in Genomics), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe) and CPT (Canadian Partnership for Tomorrow Project). Results The DataSHaPER provides a flexible, structured approach to the harmonization and pooling of information between studies. Its two primary components, the ‘DataSchema’ and ‘Harmonization Platforms’, together support the preparation of effective data-collection protocols and provide a central reference to facilitate harmonization. The DataSHaPER supports both ‘prospective’ and ‘retrospective’ harmonization. Conclusion It is hoped that this article will encourage readers to investigate the project further: the more the research groups and studies are actively involved, the more effective the DataSHaPER programme will ultimately be.

[1]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[2]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[3]  Jan-Eric Litton,et al.  Data modeling and data communication in GenomEUtwin. , 2003, Twin research : the official journal of the International Society for Twin Studies.

[4]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[5]  P Galan,et al.  Obesity and other health determinants across Europe: The EURALIM Project , 2000, Journal of epidemiology and community health.

[6]  R. Collins,et al.  Cohort profile: the Kadoorie Study of Chronic Disease in China (KSCDC). , 2005, International journal of epidemiology.

[7]  Inês Barroso,et al.  Genome-wide association study identifies five loci associated with lung function , 2010, Nature Genetics.

[8]  N Slimani,et al.  Estimation of reproducibility and relative validity of the questions included in the EPIC Physical Activity Questionnaire. , 1997, International journal of epidemiology.

[9]  J H Lubin,et al.  Power and sample size calculations in case-control studies of gene-environment interactions: comments on different approaches. , 1999, American journal of epidemiology.

[10]  Alexander Thompson,et al.  Thinking big: large-scale collaborative research in observational epidemiology , 2009, European Journal of Epidemiology.

[11]  Muin J Khoury,et al.  The human genome project is complete. How do we develop a handle for the pump? , 2003, American journal of epidemiology.

[12]  N E Day,et al.  The detection of gene-environment interaction for continuous traits: should we deal with measurement error by bigger studies or better measurement? , 2003, International journal of epidemiology.

[13]  Paul R. Burton,et al.  The global emergence of epidemiological biobanks: Opportunities and challenges , 2010 .

[14]  B. Knoppers,et al.  Population Genomics: The Public Population Project in Genomics (P3G): a proof of concept? , 2008, European Journal of Human Genetics.

[15]  R. Doll,et al.  Smoking and carcinoma of the lung; preliminary report. , 1950, British medical journal.

[16]  I. Fortier,et al.  The Public Population Project in Genomics (P 3 G): a proof of , 2008 .

[17]  E. Glaser,et al.  Using Behavioral Science Strategies for Defining the State-of-the-Art , 1980 .

[18]  P. Elliott,et al.  Size matters: just how big is BIG? , 2008, International journal of epidemiology.

[19]  F. Collins,et al.  Shattuck lecture--medical and societal consequences of the Human Genome Project. , 1999, The New England journal of medicine.

[20]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[21]  M. Tobin,et al.  DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data , 2010, International journal of epidemiology.

[22]  R. Doll,et al.  Smoking and Carcinoma of the Lung , 1950, Acta - Unio Internationalis Contra Cancrum.

[23]  David Craig,et al.  Introduction to genetic epidemiology. , 2011, Optometry.

[24]  Ronald P. Stolk,et al.  Universal risk factors for multifactorial diseases-LifeLines : a three-generation population-based study , 2008 .

[25]  R. Lande,et al.  GENOTYPE‐ENVIRONMENT INTERACTION AND THE EVOLUTION OF PHENOTYPIC PLASTICITY , 1985, Evolution; international journal of organic evolution.

[26]  S Shott,et al.  Power and Sample Size , 2014 .

[27]  Gudmundur A. Thorisson,et al.  Genotype–phenotype databases: challenges and solutions for the post-genomic era , 2009, Nature Reviews Genetics.

[28]  Eric Boerwinkle,et al.  The gene, environment association studies consortium (GENEVA): maximizing the knowledge obtained from GWAS by collaboration across studies of multiple conditions , 2010, Genetic epidemiology.

[29]  Laura J. Scott,et al.  Edinburgh Research Explorer Genome-wide association scan meta-analysis identifies three loci influencing adiposity and fat distribution , 2022 .

[30]  L. Stein,et al.  OWL Web Ontology Language - Reference , 2004 .

[31]  Paul Elliott,et al.  The UK Biobank sample handling and storage validation studies. , 2008, International journal of epidemiology.

[32]  M. Glotzer,et al.  The 3Ms of central spindle assembly: microtubules, motors and MAPs , 2009, Nature Reviews Molecular Cell Biology.

[33]  J. Fries,et al.  The Patient-Reported Outcomes Measurement Information System (PROMIS): Progress of an NIH Roadmap Cooperative Group During its First Two Years , 2007, Medical care.

[34]  M. Jarvelin,et al.  A Common Variant in the FTO Gene Is Associated with Body Mass Index and Predisposes to Childhood and Adult Obesity , 2007, Science.

[35]  Carol M Hamilton,et al.  PhenX: a toolkit for interdisciplinary genetics research , 2010, Current opinion in lipidology.

[36]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[37]  N Slimani,et al.  Structure of the standardized computerized 24-h diet recall interview used as reference method in the 22 centers participating in the EPIC project. European Prospective Investigation into Cancer and Nutrition. , 1999, Computer methods and programs in biomedicine.

[38]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[39]  Paul R. Burton,et al.  Biobanks and biobank harmonisation , 2011 .

[40]  C. Ulrich,et al.  Case-control study of overweight, obesity, and colorectal cancer risk, overall and by tumor microsatellite instability status. , 2010, Journal of the National Cancer Institute.

[41]  P. Robson,et al.  The Canadian Partnership for Tomorrow Project: building a pan-Canadian research platform for disease prevention , 2010, Canadian Medical Association Journal.

[42]  D. Hunter Gene–environment interactions in human diseases , 2005, Nature Reviews Genetics.

[43]  Samira Asma,et al.  Evolution of the Global Tobacco Surveillance System (GTSS) 1998–2008 , 2009, Global health promotion.

[44]  C. Friedenreich,et al.  Improving long-term recall in epidemiologic studies. , 1994, Epidemiology.

[45]  Anders Engeland,et al.  COHORT PROFILE Cohort Profile: Cohort of Norway (CONOR) , 2007 .

[46]  Muin J. Khoury,et al.  Quantifying realistic sample size requirements for human genome epidemiology , 2008 .

[47]  Jan-Eric Litton,et al.  Biobanking for Europe , 2007, Briefings Bioinform..

[48]  Muin J. Khoury,et al.  Human genome epidemiology : building the evidence for using genetic information to improve health and prevent disease , 2010 .

[49]  N Slimani,et al.  Measurement of past diet: review of previous and proposed methods. , 1992, Epidemiologic reviews.

[50]  E Riboli,et al.  The EPIC Project: rationale and study design. European Prospective Investigation into Cancer and Nutrition. , 1997, International journal of epidemiology.

[51]  Elio Riboli,et al.  The EPIC Project: Rationale and study design , 1997 .

[52]  Andy R Ness,et al.  The Avon Longitudinal Study of Parents and Children (ALSPAC)--a resource for the study of the environmental determinants of childhood obesity. , 2004, European journal of endocrinology.

[53]  N. Dalkey,et al.  An Experimental Application of the Delphi Method to the Use of Experts , 1963 .

[54]  P. Raina,et al.  The Canadian Longitudinal Study on Aging (CLSA)* , 2009, Canadian Journal on Aging / La Revue canadienne du vieillissement.

[55]  F. Collins,et al.  A vision for the future of genomics research , 2003, Nature.

[56]  A. Sugden ECOLOGY/EVOLUTION: Phenotypic Plasticity , 2004 .

[57]  N Slimani,et al.  Pilot phase studies on the accuracy of dietary intake measurements in the EPIC project: overall evaluation of results. European Prospective Investigation into Cancer and Nutrition. , 1997, International journal of epidemiology.

[58]  B. Ainsworth,et al.  International physical activity questionnaire: 12-country reliability and validity. , 2003, Medicine and science in sports and exercise.

[59]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[60]  D E Weisburd,et al.  What about the patient? , 1999, The Journal of clinical psychiatry.

[61]  Christian Gieger,et al.  Six new loci associated with body mass index highlight a neuronal influence on body weight regulation , 2009, Nature Genetics.

[62]  Francis S Collins,et al.  A HapMap harvest of insights into the genetics of common disease. , 2008, The Journal of clinical investigation.

[63]  S B Thacker,et al.  Methods for pooled analyses of epidemiologic studies. , 1994, Epidemiology.

[64]  P. Pietinen,et al.  European Prospective Investigation into Cancer and Nutrition: validity studies on dietary assessment methods. , 1997, International journal of epidemiology.

[65]  H. Blackburn,et al.  Cardiovascular survey methods. , 1969, Monograph series. World Health Organization.

[66]  Francis S. Collins,et al.  Genes, environment and the value of prospective cohort studies , 2006, Nature Reviews Genetics.

[67]  Muin J Khoury,et al.  The emergence of epidemiology in the genomics age. , 2004, International journal of epidemiology.

[68]  Scott M. Hofer,et al.  Integrative Analysis of Longitudinal Studies on Aging: Collaborative Research Networks, Meta-Analysis, and Optimizing Future Studies , 2008 .

[69]  M Blettner,et al.  Traditional reviews, meta-analyses and pooled analyses in epidemiology. , 1999, International journal of epidemiology.

[70]  P. O’Reilly,et al.  Genome-wide association study identifies eight loci associated with blood pressure , 2009, Nature Genetics.

[71]  David L. Vaux,et al.  IAPs, RINGs and ubiquitylation , 2005, Nature Reviews Molecular Cell Biology.

[72]  Sarah Lewis,et al.  Genetic epidemiology and public health: hope, hype, and future prospects , 2005, The Lancet.

[73]  M. García-Closas,et al.  Misclassification in case-control studies of gene-environment interactions: assessment of bias and sample size. , 1999, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.