Recommendations for the FAIRification of genomic track metadata

Background: Many types of data from genomic analyses can be represented as genomic tracks, i.e. features linked to the genomic coordinates of a reference genome. Examples of such data are epigenetic DNA methylation data, ChIP-seq peaks, germline or somatic DNA variants, as well as RNA-seq expression levels. Researchers often face difficulties in locating, accessing and combining relevant tracks from external sources, as well as locating the raw data, reducing the value of the generated information. Description of work: We propose to advance the application of FAIR data principles (Findable, Accessible, Interoperable, and Reusable) to produce searchable metadata for genomic tracks. Findability and Accessibility of metadata can then be ensured by a track search service that integrates globally identifiable metadata from various track hubs in the Track Hub Registry and other relevant repositories. Interoperability and Reusability need to be ensured by the specification and implementation of a basic set of recommendations for metadata. We have tested this concept by developing such a specification in a JSON Schema, called FAIRtracks, and have integrated it into a novel track search service, called TrackFind. We demonstrate practical usage by importing datasets through TrackFind into existing examples of relevant analytical tools for genomic tracks: EPICO and the GSuite HyperBrowser. Conclusion: We here provide a first iteration of a draft standard for genomic track metadata, as well as the accompanying software ecosystem. It can easily be adapted or extended to future needs of the research community regarding data, methods and tools, balancing the requirements of both data submitters and analytical end-users.

[1]  Mulin Jun Li,et al.  epiCOLOC: Integrating Large-Scale and Context-Dependent Epigenomics Features for Comprehensive Colocalization Analysis , 2020, Frontiers in Genetics.

[2]  Mikhail G. Dozmorov,et al.  GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets , 2016, Bioinform..

[3]  Galt P. Barber,et al.  BigWig and BigBed: enabling browsing of large distributed datasets , 2010, Bioinform..

[4]  Arcadi Navarro,et al.  The European Genome-phenome Archive of human data consented for biomedical research , 2015, Nature Genetics.

[5]  G. K. Sandve,et al.  The Genomic HyperBrowser: inferential genomics at the sequence level , 2010, Genome Biology.

[6]  Alfonso Valencia,et al.  The BLUEPRINT Data Analysis Portal. , 2016, Cell systems.

[7]  Diana Domanska,et al.  Genome build information is an essential part of genomic track files , 2017, Genome Biology.

[8]  Thomas Lengauer,et al.  EpiExplorer: live exploration and global analysis of large epigenomic datasets , 2012, Genome Biology.

[9]  Mirit I Aladjem,et al.  ColoWeb: a resource for analysis of colocalization of genomic features , 2015, BMC Genomics.

[10]  Geir Kjetil Sandve,et al.  Identifying elemental genomic track types and representing them uniformly , 2011, BMC Bioinformatics.

[11]  Mikhail G. Dozmorov,et al.  Epigenomic annotation‐based interpretation of genomic data: from enrichment analysis to machine learning , 2017, Bioinform..

[12]  Carole Goble,et al.  Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv , 2019, GigaScience.

[13]  Pierre-Étienne Jacques,et al.  The International Human Epigenome Consortium Data Portal. , 2016, Cell systems.

[14]  Biswanath Dutta,et al.  Harnessing the Power of Unified Metadata in an Ontology Repository: The Case of AgroPortal , 2018, Journal on Data Semantics.

[15]  Chakravarthi Kanduri,et al.  Colocalization analyses of genomic elements: approaches, recommendations and challenges , 2018, Bioinform..

[16]  M. Hirst,et al.  The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery , 2016, Cell.

[17]  Helen E. Parkinson,et al.  BioSamples database: an updated sample metadata hub , 2018, Nucleic Acids Res..

[18]  Washington Seattle An integrated encyclopedia of DNA elements in the human genome , 2016 .

[19]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[20]  Lawrence B. Holder,et al.  Machine learning for epigenetics and future medical applications , 2017, Epigenetics.

[21]  David Haussler,et al.  UCSC Genome Browser enters 20th year , 2019, Nucleic Acids Res..

[22]  Anton Nekrutenko,et al.  Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[23]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[24]  Alfonso Valencia,et al.  Towards FAIR principles for research software , 2020, Data Sci..

[25]  Lucila Ohno-Machado,et al.  DATS, the data tag suite to enable discoverability of datasets , 2017, Scientific Data.

[26]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[27]  Thomas Lengauer,et al.  DeepBlue epigenomic data server: programmatic data retrieval and analysis of epigenome region sets , 2016, Nucleic Acids Res..

[28]  Astrid Gall,et al.  Ensembl 2020 , 2019, Nucleic Acids Res..

[29]  John Kunze,et al.  Uniform resolution of compact identifiers for biomedical data , 2017, Scientific Data.

[30]  Finn Drabløs,et al.  GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome , 2016, bioRxiv.

[31]  Thomas Lengauer,et al.  BLUEPRINT to decode the epigenetic signature written in blood , 2012, Nature Biotechnology.

[32]  Nathan C. Sheffield,et al.  LOLAweb: a containerized web server for interactive genomic locus overlap enrichment analysis , 2018, Nucleic Acids Res..

[33]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.