Linked cancer genome atlas database

The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional pilot project to create an atlas of genetic mutations responsible for cancer. One of the aims of this project is to develop an infrastructure for making the cancer related data publicly accessible, to enable cancer researchers anywhere around the world to make and validate important discoveries. However, data in the cancer genome atlas are organized as text archives in a set of directories. Devising bioinformatics applications to analyse such data is still challenging, as it requires downloading very large archives and parsing the relevant text files in order to collect the critical co-variates necessary for analysis. Furthermore, the various types of experimental results are not connected biologically, i.e. in order to truly exploit the data in the genome-wide context in which the TCGA project was devised, the data needs to be converted into a structured representation and made publicly available for remote querying and virtual integration. In this work, we address these issues by RDFizing data from TCGA and linking its elements to the Linked Open Data (LOD) Cloud. The outcome is the largest LOD data source (to the best of our knowledge) comprising of over 30 billion triples. This data source can be exploited through publicly available SPARQL endpoints, thus providing an easy-to-use, time-efficient, and scalable solution to accessing the Cancer Genome Atlas. We also describe showcases which are enabled by the new linked data representation of the Cancer Genome Atlas presented in this paper.

[1]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[2]  Rong Wang,et al.  Glioblastoma stem-like cells give rise to tumour endothelium , 2010, Nature.

[3]  Tim Hui-Ming Huang,et al.  An empirical Bayes model for gene expression and methylation profiles in antiestrogen resistant breast cancer , 2010, BMC Medical Genomics.

[4]  L. Chin,et al.  Making sense of cancer genomic data. , 2011, Genes & development.

[5]  Murat M. Tanik,et al.  A self-updating road map of The Cancer Genome Atlas , 2013, Bioinform..

[6]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[7]  R. Wilson,et al.  Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. , 2010, Cancer cell.

[8]  Erchin Serpedin,et al.  Reducing confounding and suppression effects in TCGA data: an integrated analysis of chemotherapy response in ovarian cancer , 2012, BMC Genomics.

[9]  Axel-Cyrille Ngonga Ngomo,et al.  On Link Discovery using a Hybrid Approach , 2012, Journal on Data Semantics.

[10]  Günter Klambauer,et al.  Enabling Large-Scale Bioinformatics Data Analysis with Cloud Computing , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[11]  John N. Weinstein,et al.  Exposing the cancer genome atlas as a SPARQL endpoint , 2010, J. Biomed. Informatics.

[12]  John D. Minna,et al.  GWAS Meets TCGA to Illuminate Mechanisms of Cancer Predisposition , 2013, Cell.

[13]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[14]  Kimberly D. Siegmund,et al.  Statistical approaches for the analysis of DNA methylation microarray data , 2011, Human Genetics.