OpenFlyData: An exemplar data web integrating gene expression data on the fruit fly Drosophila melanogaster

MOTIVATION Integrating heterogeneous data across distributed sources is a major requirement for in silico bioinformatics supporting translational research. For example, genome-scale data on patterns of gene expression in the fruit fly Drosophila melanogaster are widely used in functional genomic studies in many organisms to inform candidate gene selection and validate experimental results. However, current data integration solutions tend to be heavy weight, and require significant initial and ongoing investment of effort. Development of a common Web-based data integration infrastructure (a.k.a. data web), using Semantic Web standards, promises to alleviate these difficulties, but little is known about the feasibility, costs, risks or practical means of migrating to such an infrastructure. RESULTS We describe the development of OpenFlyData, a proof-of-concept system integrating gene expression data on D. melanogaster, combining Semantic Web standards with light-weight approaches to Web programming based on Web 2.0 design patterns. To support researchers designing and validating functional genomic studies, OpenFlyData includes user-facing search applications providing intuitive access to and comparison of gene expression data from FlyAtlas, the BDGP in situ database, and FlyTED, using data from FlyBase to expand and disambiguate gene names. OpenFlyData's services are also openly accessible, and are available for reuse by other bioinformaticians and application developers. Semi-automated methods and tools were developed to support labour- and knowledge-intensive tasks involved in deploying SPARQL services. These include methods for generating ontologies and relational-to-RDF mappings for relational databases, which we illustrate using the FlyBase Chado database schema; and methods for mapping gene identifiers between databases. The advantages of using Semantic Web standards for biomedical data integration are discussed, as are open issues. In particular, although the performance of open source SPARQL implementations is sufficient to query gene expression data directly from user-facing applications such as Web-based data fusions (a.k.a. mashups), we found open SPARQL endpoints to be vulnerable to denial-of-service-type problems, which must be mitigated to ensure reliability of services based on this standard. These results are relevant to data integration activities in translational bioinformatics. AVAILABILITY The gene expression search applications and SPARQL endpoints developed for OpenFlyData are deployed at http://openflydata.org. FlyUI, a library of JavaScript widgets providing re-usable user-interface components for Drosophila gene expression data, is available at http://flyui.googlecode.com. Software and ontologies to support transformation of data from FlyBase, FlyAtlas, BDGP and FlyTED to RDF are available at http://openflydata.googlecode.com. SPARQLite, an implementation of the SPARQL protocol, is available at http://sparqlite.googlecode.com. All software is provided under the GPL version 3 open source license.

[1]  Kei-Hoi Cheung,et al.  Advancing translational research with the Semantic Web , 2007, BMC Bioinformatics.

[2]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[3]  M. Ashburner,et al.  Systematic determination of patterns of gene expression during Drosophila embryogenesis , 2002, Genome Biology.

[4]  Robert Richards,et al.  Representational State Transfer (REST) , 2006 .

[5]  G. Rubin,et al.  Global analysis of patterns of gene expression during Drosophila embryogenesis , 2007, Genome Biology.

[6]  Mark Fischetti,et al.  Weaving the web - the original design and ultimate destiny of the World Wide Web by its inventor , 1999 .

[7]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[8]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[9]  R. Hoffmann A wiki for the life sciences where authorship matters , 2008, Nature Genetics.

[10]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[11]  Lincoln D. Stein,et al.  Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges , 2008, Nature Reviews Genetics.

[12]  David Shotton 3.1 Data Webs for Image Repositories , 2010 .

[13]  Douglas C. Schmidt,et al.  Guest Editor's Introduction: Model-Driven Engineering , 2006, Computer.

[14]  Julie M. Sullivan,et al.  FlyMine: an integrated database for Drosophila and Anopheles genomics , 2007, Genome Biology.

[15]  Eric K. Neumann,et al.  Identifying disease-causal genes using Semantic Web-based representation of integrated genomic and phenomic knowledge , 2008, J. Biomed. Informatics.

[16]  Kei-Hoi Cheung,et al.  HCLS 2.0/3.0: Health care and life sciences data mashup using Web 2.0/3.0 , 2008, J. Biomed. Informatics.

[17]  H. White-Cooper,et al.  Comet and cup genes in Drosophila spermatogenesis: the first demonstration of post-meiotic transcription. , 2008, Biochemical Society transactions.

[18]  Databases in peril , 2005, Nature Cell Biology.

[19]  W. Dutton,et al.  World Wide Research: Reshaping the Sciences and Humanities , 2010 .

[20]  Mark D. Wilkinson,et al.  Moby and Moby 2: Creatures of the Deep (Web) , 2009, Briefings Bioinform..

[21]  David M. Shotton,et al.  FlyTED: the Drosophila Testis Gene Expression Database , 2009, Nucleic Acids Res..

[22]  Alfonso Valencia,et al.  Interoperability with Moby 1.0--it's better than sharing your toothbrush! , 2008, Briefings in bioinformatics.

[23]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[24]  David M. Shotton,et al.  OpenFlyData: The Way to Go for Biological Data Integration , 2009, DILS.

[25]  David Charles De Roure,et al.  myExperiment: social networking for workflow-using e-scientists , 2007, WORKS '07.

[26]  David M. Shotton,et al.  Building a Semantic Web Image Repository for Biological Research Images , 2008, ESWC.

[27]  Carole A. Goble,et al.  Performing statistical analyses on quantitative data in Taverna workflows: An example using R and maxdBrowse to identify differentially-expressed genes from microarray data , 2008, BMC Bioinformatics.

[28]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[29]  Chris Mungall,et al.  A Chado case study: an ontology-based modular schema for representing genome-associated biological information , 2007, ISMB/ECCB.

[30]  R. Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures (CHAPTER 5) , 2000 .

[31]  E. Salmon Gene Expression During the Life Cycle of Drosophila melanogaster , 2002 .

[32]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[33]  Carole A. Goble,et al.  The myGrid ontology: bioinformatics service discovery , 2007, Int. J. Bioinform. Res. Appl..

[34]  Carole A. Goble,et al.  myGrid: personalised bioinformatics on the information grid , 2003, ISMB.

[35]  J. Dow,et al.  Using FlyAtlas to identify better Drosophila melanogaster models of human disease , 2007, Nature Genetics.

[36]  Lee Feigenbaum,et al.  The Semantic Web in action. , 2007, Scientific American.

[37]  Kei-Hoi Cheung,et al.  Linking Open Drug Data , 2009, I-SEMANTICS.

[38]  David M. Shotton Data Webs for Image Repositories , 2010, World Wide Research.

[39]  Victor B. Strelets,et al.  FlyBase: anatomical data, images and queries , 2005, Nucleic Acids Res..

[40]  David M. Shotton,et al.  CLAROS - Bringing Classical Art to a Global Public , 2009, 2009 Fifth IEEE International Conference on e-Science.

[41]  P. Tomançak,et al.  Global Analysis of mRNA Localization Reveals a Prominent Role in Organizing Cellular Architecture and Function , 2007, Cell.