Smart Persistence and Accessibility of Genomic and Clinical Data

The continuous growth of experimental data generated by Next Generation Sequencing (NGS) machines has led to the adoption of advanced techniques to intelligently manage them. The advent of the Big Data era posed new challenges that led to the development of novel methods and tools, which were initially born to face with computational science problems, but which nowadays can be widely applied on biomedical data. In this work, we address two biomedical data management issues: (i) how to reduce the redundancy of genomic and clinical data, and (ii) how to make this big amount of data easily accessible. Firstly, we propose an approach to optimally organize genomic and clinical data by taking into account data redundancy and propose a method able to save as much space as possible by exploiting the power of no-SQL technologies. Then, we propose design principles for organizing biomedical data and make them easily accessible through the development of a collection of Application Programming Interfaces (APIs), in order to provide a flexible framework that we called OpenOmics. To prove the validity of our approach, we apply it on data extracted from The Genomic Data Commons repository. OpenOmics is free and open source for allowing everyone to extend the set of provided APIs with new features that may be able to answer specific biological questions. They are hosted on GitHub at the following address https://github.com/fabio-cumbo/open-omics-api/, publicly queryable at http://bioinformatics.iasi.cnr.it/openomics/api/routes, and their documentation is available at https://openomics.docs.apiary.io/.

[1]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[2]  Giovanni Felici,et al.  Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction , 2018, BioData Mining.

[3]  Karin M. Verspoor,et al.  Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study , 2016, bioRxiv.

[4]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[5]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[6]  H. Lehrach,et al.  Somatic Mutation Profiles of MSI and MSS Colorectal Cancer Identified by Whole Exome Next Generation Sequencing and Bioinformatics Analysis , 2010, PloS one.

[7]  Giovanni Felici,et al.  IRIS-TCGA: An Information Retrieval and Integration System for Genomic Data of Cancer , 2016, CIBB.

[8]  B. Cullen,et al.  Sequence requirements for micro RNA processing and function in human cells. , 2003, RNA.

[9]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[10]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[11]  Giovanni Felici,et al.  CamurWeb: a classification software and a large knowledge base for gene expression data of cancer , 2018, BMC Bioinformatics.

[12]  Giovanni Felici,et al.  Genomic Data Integration: A Case Study on Next Generation Sequencing of Cancer , 2016, 2016 27th International Workshop on Database and Expert Systems Applications (DEXA).

[13]  P. Stenson,et al.  The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies , 2017, Human Genetics.

[14]  Giovanni Felici,et al.  Classifying Big DNA Methylation Data: A Gene-Oriented Approach , 2018, DEXA Workshops.

[15]  Marco Masseroli,et al.  TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas , 2016, BMC Bioinformatics.

[16]  L. Staudt,et al.  The NCI Genomic Data Commons as an engine for precision medicine. , 2017, Blood.

[17]  Marco Masseroli,et al.  OpenGDC: standardizing, extending, and integrating genomics data of cancer , 2018 .

[18]  K. Gunderson,et al.  High density DNA methylation array with single CpG site resolution. , 2011, Genomics.