Trecode: a FAIR eco-system for the analysis and archiving of omics data in a combined diagnostic and research setting

Motivation The increase in speed, reliability and cost-effectiveness of high-throughput sequencing has led to the widespread clinical application of genome (WGS), exome (WXS) and transcriptome analysis. WXS and RNA sequencing is now being implemented as standard of care for patients and for patients included in clinical studies. To keep track of sample relationships and analyses, a platform is needed that can unify metadata for diverse sequencing strategies with sample metadata whilst supporting automated and reproducible analyses. In essence ensuring that analysis is conducted consistently, and data is Findable, Accessible, Interoperable and Reusable (FAIR). Results We present “Trecode”, a framework that records both clinical and research sample (meta) data and manages computational genome analysis workflows executed for both settings. Thereby achieving tight integration between analyses results and sample metadata. With complete, consistent and FAIR (meta) data management in a single platform, stacked bioinformatic analyses are performed automatically and tracked by the database ensuring data provenance, reproducibility and reusability which is key in worldwide collaborative translational research. Availability and implementation The Trecode data model, codebooks, NGS workflows and client programs are currently being cleared from local compute infrastructure dependencies and will become publicly available in spring 2021. Contact p.kemmeren@prinsesmaximacentrum.nl

[1]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[2]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[3]  A. Chinnaiyan,et al.  Precision oncology in the age of integrative genomics , 2018, Nature Biotechnology.

[4]  Kamal Kishore,et al.  Integrated Systems for NGS Data Management and Analysis: Open Issues and Available Solutions , 2016, Front. Genet..

[5]  Yoon-Jae Cho,et al.  Pediatric oncology enters an era of precision medicine. , 2017, Current problems in cancer.

[6]  J. Michael Cherry,et al.  Principles of metadata organization at the ENCODE data coordination center , 2016, Database J. Biol. Databases Curation.

[7]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[8]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[9]  Anna Zhukova,et al.  Modeling sample variables with an Experimental Factor Ontology , 2010, Bioinform..

[10]  Susanna-Assunta Sansone,et al.  linkedISA: semantic representation of ISA-Tab experimental metadata , 2014, BMC Bioinformatics.

[11]  Kyung-Sup Kwak,et al.  SNOMED CT standard ontology based on the ontology for general medical science , 2018, BMC Medical Informatics and Decision Making.

[12]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[13]  Steve Pettifer,et al.  EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats , 2013, Bioinform..

[14]  Måns Magnusson,et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report , 2016, Bioinform..

[15]  A. Sivachenko,et al.  Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples , 2013, Nature Biotechnology.

[16]  Hideaki Sugawara,et al.  Archiving next generation sequencing data , 2009, Nucleic Acids Res..

[17]  Roland Eils,et al.  OTP: An automatized system for managing and processing NGS data. , 2017, Journal of biotechnology.

[18]  Jessica A. Turner,et al.  The Ontology for Biomedical Investigations , 2016, PloS one.

[19]  I. Kyrochristos,et al.  Bulk and Single-Cell Next-Generation Sequencing: Individualizing Treatment for Colorectal Cancer , 2019, Cancers.

[20]  Bjørn Fjukstad,et al.  A Review of Scalable Bioinformatics Pipelines , 2017, Data Science and Engineering.

[21]  Timothy L. Tickle,et al.  STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq , 2017, bioRxiv.

[22]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[23]  Benjamin E. Gross,et al.  The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. , 2012, Cancer discovery.

[24]  Fakhri Alam Khan,et al.  Provenance based data integrity checking and verification in cloud environments , 2017, PloS one.

[25]  Pranav Kulkarni,et al.  Challenges in the Setup of Large-scale Next-Generation Sequencing Analysis Workflows , 2017, Computational and structural biotechnology journal.

[26]  Michael R. Crusoe,et al.  Common Workflow Language , 2015 .

[27]  P. L. Bergsagel,et al.  Mate pair sequencing outperforms fluorescence in situ hybridization in the genomic characterization of multiple myeloma , 2019, Blood Cancer Journal.

[28]  Peter Frommolt,et al.  QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation , 2015, BMC Genomics.

[29]  T. Haferlach,et al.  The combination of WGS and RNA-Seq is superior to conventional diagnostic tests in multiple myeloma: Ready for prime time? , 2020, Cancer genetics.

[30]  Byungwook Lee,et al.  Closha: bioinformatics workflow system for the analysis of massive sequencing data , 2018, BMC Bioinformatics.

[31]  Morris A. Swertz,et al.  MOLGENIS research: advanced bioinformatics data software for non-bioinformaticians , 2018, Bioinform..

[32]  James A. Hendler,et al.  The National Cancer Institute's Thésaurus and Ontology , 2003, J. Web Semant..

[33]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..