TopFed: TCGA tailored federated query processing and linking to LOD

BackgroudThe Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis.MethodsWe address these issues by transforming the TCGA data into the Semantic Web standard Resource Description Format (RDF), link it to relevant datasets in the Linked Open Data (LOD) cloud and further propose an efficient data distribution strategy to host the resulting 20.4 billion triples data via several SPARQL endpoints. Having the TCGA data distributed across multiple SPARQL endpoints, we enable biomedical scientists to query and retrieve information from these SPARQL endpoints by proposing a TCGA tailored federated SPARQL query processing engine named TopFed.ResultsWe compare TopFed with a well established federation engine FedX in terms of source selection and query execution time by using 10 different federated SPARQL queries with varying requirements. Our evaluation results show that TopFed selects on average less than half of the sources (with 100% recall) with query execution time equal to one third to that of FedX.ConclusionWith TopFed, we aim to offer biomedical scientists a single-point-of-access through which distributed TCGA data can be accessed in unison. We believe the proposed system can greatly help researchers in the biomedical domain to carry out their research effectively with TCGA as the amount and diversity of data exceeds the ability of local resources to handle its retrieval and parsing.

[1]  Erchin Serpedin,et al.  Reducing confounding and suppression effects in TCGA data: an integrated analysis of chemotherapy response in ovarian cancer , 2012, BMC Genomics.

[2]  Axel-Cyrille Ngonga Ngomo,et al.  On Link Discovery using a Hybrid Approach , 2012, Journal on Data Semantics.

[3]  Günter Ladwig,et al.  Linked Data Query Processing Strategies , 2010, SEMWEB.

[4]  Jürgen Umbrich,et al.  Data summaries for on-demand queries over linked data , 2010, WWW '10.

[5]  Stefan Decker,et al.  Linked cancer genome atlas database , 2013, I-SEMANTICS '13.

[6]  klaguia International Network of Cancer Genome Projects , 2010 .

[7]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[8]  Jeff Heflin,et al.  Using Reformulation Trees to Optimize Queries over Distributed Heterogeneous Sources , 2010, International Semantic Web Conference.

[9]  Dominique Brunel,et al.  SNP mining in C. clementina BAC end sequences; transferability in the Citrus genus (Rutaceae), phylogenetic inferences and perspectives for genetic mapping , 2012, BMC Genomics.

[10]  Manfred Hauswirth,et al.  DAW: Duplicate-AWare Federated Query Processing over the Web of Data , 2013, SEMWEB.

[11]  Tim Hui-Ming Huang,et al.  An empirical Bayes model for gene expression and methylation profiles in antiestrogen resistant breast cancer , 2010, BMC Medical Genomics.

[12]  Steffen Staab,et al.  SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions , 2011, COLD.

[13]  T. Hampton,et al.  The Cancer Genome Atlas , 2020, Indian Journal of Medical and Paediatric Oncology.

[14]  Muhammad Saleem,et al.  Big linked cancer data: Integrating linked TCGA and PubMed , 2014, J. Web Semant..

[15]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[16]  Wolfram Wöß,et al.  A Semantic Web middleware for Virtual Data Integration on the Web , 2008, ESWC.

[17]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[18]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[19]  Rong Wang,et al.  Glioblastoma stem-like cells give rise to tumour endothelium , 2010, Nature.

[20]  Manolis Koubarakis,et al.  SPARQL Query Optimization on Top of DHTs , 2010, SEMWEB.

[21]  John D. Minna,et al.  GWAS Meets TCGA to Illuminate Mechanisms of Cancer Predisposition , 2013, Cell.

[22]  Muhammad Saleem,et al.  HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation , 2014, ESWC.

[23]  Abraham Bernstein,et al.  Avalanche: Putting the Spirit of the Web back into Semantic Web Querying , 2010, ISWC Posters&Demos.

[24]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[25]  Jürgen Umbrich,et al.  Comparing data summaries for processing live queries over Linked Data , 2011, World Wide Web.

[26]  L. Chin,et al.  Making sense of cancer genomic data. , 2011, Genes & development.

[27]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[28]  Katja Hose,et al.  FedX: Optimization Techniques for Federated Query Processing on Linked Data , 2011, SEMWEB.

[29]  Aftab Iqbal Fostering Serendipity through Big Linked Data , 2013 .

[30]  Maribel Acosta,et al.  ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints , 2011, SEMWEB.

[31]  Lora Aroyo,et al.  The Semantic Web – ISWC 2013 , 2013, Lecture Notes in Computer Science.

[32]  Muhammad Saleem,et al.  A fine-grained evaluation of SPARQL endpoint federation systems , 2016, Semantic Web.

[33]  Günter Klambauer,et al.  Enabling Large-Scale Bioinformatics Data Analysis with Cloud Computing , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[34]  John N. Weinstein,et al.  Exposing the cancer genome atlas as a SPARQL endpoint , 2010, J. Biomed. Informatics.

[35]  Ulf Leser,et al.  Querying Distributed RDF Data Sources with SPARQL , 2008, ESWC.

[36]  Helena F. Deus,et al.  Exploratory Analysis of the Copy Number Alterations in Glioblastoma Multiforme , 2008, PloS one.

[37]  Kimberly D. Siegmund,et al.  Statistical approaches for the analysis of DNA methylation microarray data , 2011, Human Genetics.

[38]  Stefan Decker,et al.  GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research , 2014 .

[39]  R. Wilson,et al.  Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. , 2010, Cancer cell.