Exposing the cancer genome atlas as a SPARQL endpoint

The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.

[1]  Adrian Paschke,et al.  A journey to Semantic Web query federation in the life sciences , 2009, BMC Bioinformatics.

[2]  Helen Parkinson,et al.  The MGED Ontology: A Framework for Describing Functional Genomics Experiments , 2003, Comparative and functional genomics.

[3]  Shelley Powers Practical RDF - solving problems with the resource description framework , 2003 .

[4]  Mark D. Wilkinson,et al.  Moby and Moby 2: Creatures of the Deep (Web) , 2009, Briefings Bioinform..

[5]  Helena F. Deus,et al.  Data integration gets 'Sloppy' , 2006, Nature Biotechnology.

[6]  Sherri de Coronado,et al.  NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information , 2007, J. Biomed. Informatics.

[7]  Michael Krauthammer,et al.  Semantic web data warehousing for caGrid , 2009, BMC Bioinformatics.

[8]  Christopher G Chute,et al.  National Center for Biomedical Ontology: advancing biomedicine through structured organization of scientific knowledge. , 2006, Omics : a journal of integrative biology.

[9]  Kei-Hoi Cheung,et al.  Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences , 2006 .

[10]  Alan Ruttenberg,et al.  Life sciences on the Semantic Web: the Neurocommons and beyond , 2009, Briefings Bioinform..

[11]  Martin Kuiper,et al.  Biological knowledge management: the emerging role of the Semantic Web technologies , 2009, Briefings Bioinform..

[12]  Paul T. Spellman,et al.  A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB , 2006, BMC Bioinformatics.

[13]  Olivier Bodenreider,et al.  The caBIG terminology review process , 2009, J. Biomed. Informatics.

[14]  Marios D. Dikaiakos,et al.  MashQL: a query-by-diagram topping SPARQL , 2008, ONISW '08.

[15]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[16]  Johan Rung,et al.  Advances in systems biology: measurement, modeling and representation. , 2006, Current opinion in drug discovery & development.

[17]  Kei-Hoi Cheung,et al.  Comprar Semantic Web · Revolutionizing Knowledge Discovery in the Life Sciences | Baker, Christopher J.O. | 9780387484365 | Springer , 2007 .

[18]  Xiaoshu Wang,et al.  From XML to RDF: how semantic web technologies will change the design of 'omic' standards , 2005, Nature Biotechnology.

[19]  Johan Bollen,et al.  Using RDF to Model the Structure and Process of Systems , 2007, ArXiv.

[20]  klaguia International Network of Cancer Genome Projects , 2010 .

[21]  Natalya F. Noy,et al.  BioPortal: Ontologies and Integrated Data Resources at the Click of a Mouse , 2009 .

[22]  Yimin Wang,et al.  Semantic Web for Health Care and Life Sciences: a review of the state of the art , 2009, Briefings Bioinform..

[23]  Mansur R. Kabuka,et al.  Model Formulation: semCDI: A Query Formulation for Semantic Data Integration in caBIG , 2008, J. Am. Medical Informatics Assoc..

[24]  Kei-Hoi Cheung,et al.  Advancing translational research with the Semantic Web , 2007, BMC Bioinformatics.

[25]  Susie Stephens,et al.  Applying semantic Web technologies to drug safety determination , 2006, IEEE Intelligent Systems.

[26]  Helena F. Deus,et al.  A Semantic Web Management Model for Integrative Biomedical Informatics , 2008, PloS one.

[27]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[28]  David R. Karger,et al.  Exhibit: lightweight structured data publishing , 2007, WWW '07.

[29]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[30]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[31]  Helena F. Deus,et al.  Exploratory Analysis of the Copy Number Alterations in Glioblastoma Multiforme , 2008, PloS one.

[32]  Carole A. Goble,et al.  Data curation + process curation=data integration + science , 2008, Briefings Bioinform..

[33]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.