A community driven GWAS summary statistics standard

Summary statistics from genome-wide association studies (GWAS) represent a huge potential for research. A challenge for researchers in this field is the access and sharing of summary statistics data due to a lack of standards for the data content and file format. For this reason, the GWAS Catalog hosted a series of meetings in 2021 with summary statistics stakeholders to guide the development of a standard format. The key requirements from the stakeholders were for a standard that contained key data elements to be able to support a wide range of data analyses, required low bioinformatics skills for file access and generation, to have easily accessible metadata, and unambiguous and interoperable data. Here, we define the specifications for the first version of the GWAS-SSF format, which was developed to meet the requirements discussed with the community. GWAS-SSF consists of a tab-separated data file with well-defined fields and an accompanying metadata file.

[1]  Max Kozlov NIH issues a seismic mandate: share data publicly , 2022, Nature.

[2]  Sri V. V. Deevi,et al.  Rare variant contribution to human disease in 281,104 UK Biobank exomes , 2021, Nature.

[3]  Alan E. Murphy,et al.  MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics , 2021, bioRxiv.

[4]  Valeriia Haberland,et al.  The MRC IEU OpenGWAS data infrastructure , 2020, bioRxiv.

[5]  Tom R. Gaunt,et al.  The variant call format provides efficient and robust storage of GWAS summary statistics , 2020, Genome Biology.

[6]  M. Inouye,et al.  Towards clinical utility of polygenic risk scores. , 2019, Human molecular genetics.

[7]  Helen E. Parkinson,et al.  The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 , 2018, Nucleic Acids Res..

[8]  Anders M. Dale,et al.  Identification of Genetic Loci Jointly Influencing Schizophrenia Risk and the Cognitive Traits of Verbal-Numerical Reasoning, Reaction Time, and General Cognitive Function , 2017, JAMA psychiatry.

[9]  Christian Gieger,et al.  Impact of common genetic determinants of Hemoglobin A1c on type 2 diabetes risk and diagnosis in ancestrally diverse populations: A transethnic genome-wide meta-analysis , 2017, PLoS Medicine.

[10]  G. Davey Smith,et al.  Genetic epidemiology and Mendelian randomization for informing disease therapeutics: Conceptual and methodological challenges , 2017, bioRxiv.

[11]  Laura W. Harris,et al.  A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog , 2017, bioRxiv.

[12]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[13]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[14]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[15]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.