Linked Environment Data

Currently several projects at the German Federal Environment Agency (UBA) begin with the design and implementation of a public data network that is technologically based on Linked Data. The first ones will be the Environmental Specimen Bank (ESB) and the Semantic Network Service (SNS); the inclusion of the Dioxin Database and the Joint Substance Data Pool of the German Federal Government and the German Federal States (GSBL) is still under discussion. The undertaking is part of an international cooperation in the Ecoterm Initiative, and it is envisioned to include the partners of the International Environmental Specimen Bank Group (IESB). These projects and partners provide the key instruments in the field of environmental observation that enable the long-term analysis of substance exposure of humans and the environment. 1. Linked Data and Environmental Informatics Since the 1990’s, the linking of environmental data and technical vocabularies is one of the UBA’s main goals which has been pursued since several project generations (UMPLIS, UDK, GEIN, SNS, PortalU). All previous efforts, however, have two common drawbacks:  Up to now, only data containers (databases, information systems, complex Web pages) have been linked together – and not individual data records.  There is no common access to a shared data structure, so that each cross-reference ends at the doors of the referenced data store, or, in the best case, at a Web page describing the access. Linked Data is different: a network of individual data elements linked together for direct access and navigation on the Web. The linking mechanism is based on Web addresses (HTTP URIs) for each data element and on the universal data model of the Resource Description Framework (RDF). 1 http://linkeddata.org/ 2 http://www.inter-esb.org/ 2 Maria Rüther, Joachim Fock, Thomas Bandholtz, Till Schulte-Coerne In 2006, Tim Berners-Lee formulated four “Linked Data Principles”: 1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) 4. Include links to other URIs. so that they can discover more things. Though some of this wording may be discussable, the Semantic Web Community has published Billions of RDF triples and links in the following years. Starting from any point in this part of the Web, one can easily “discover more things” click by click. The “Linked Data Cloud” (Figure 1) has become a huge playground for the exploration of this technology. Figure 1 Linked Data Cloud (from) All this may be seen as yet another example of (more or less academic) community enthusiasm around the Linked Data representation of Wikipedia, DBpedia. 3 Tim Berners-Lee, 2006-07-27 http://www.w3.org/DesignIssues/LinkedData.html 4 http://richard.cyganiak.de/2007/10/lod/ Linked Environment Data (Draft, June 2010) 3 However, there are also more serious, scientific efforts. The most elaborated example is the Linking Open Drug Data (LODD) sub-cloud in the EHealth community. Figure 2 Linking Open Drug Data (LODD) 2. Linked Environment Data The LODD example gave birth to the idea of linking environmental data in an international context of cooperating governmental authorities. This idea was discussed at Workshop V (Ecoterm 2009) of the Ecoterm Group with members from many European countries and the US. Ecoterm fosters “a federated approach to accessing terminology and knowledge organization systems in the area of the environment that would allow them to be accessed, interchanged, and used in traditional indexing and search approaches, as well as semantic web applications. The idea is to share the content of these rich resources in such a way that duplication of effort can be avoided and interchange and integration of various structured and unstructured data can be enhanced. The approach should allow the vocabularies to be linked over time, as appropriate, and for resources to be linked to these vocabularies.” (Ecoterm 2009) The focus is clearly on setting up reference vocabularies. Participating authorities have started setting up a trusted network of domain ontologies from different environmental facets and multiple languages. Such ontologies are going to be published in RDF (namely SKOS) and described in the Vocabulary of Interlinked Data (VOID). Some of them have already gone live or will be going live in the next months. The commitment to linking environmental data to these vocabularies is rather vague: the “SKOS files would also be published as linked data”, but someone else has to publish the observation data and link it to the reference vocabulary. 5 http://esw.w3.org/HCLSIG/LODD 6 http://ecoterm.infointl.com 7 http://vocab.deri.ie/void/guide 4 Maria Rüther, Joachim Fock, Thomas Bandholtz, Till Schulte-Coerne Figure 3 shows such plans from Germany. Blue bubbles stand for SKOS vocabularies in the Ecoterm context, white squares for information systems holding observation data. Figure 3 Linked Environment Data (Vision@de 2009) Semantic Network Service (SNS) is maintained by the UBA since 2003. SNS includes a thesaurus (UMTHES), a gazetteer and a chronicle with occasional interlinkage among each other. All three are currently available in the XML Topic Maps format. A first draft of an RDF vocabulary for SNS has been presented in 2006, but until today only the thesaurus has been migrated into a SKOS-XL representation (sees more details about the RDF models of SNS in section 3). Figure 3 shows links from SNS to several European vocabularies (top left). The European reference vocabulary since years is the GEneral Multilingual Environmental Thesaurus (GEMET), maintained by the European Environment Agency (EEA). GEMET has been one of the first SKOS use cases in 2004 and is still available in this serialization. Since last year it is also published using the Linked Data technical patterns. UMTHES is already linked with GEMET, so we do not need any direct linkage between ESB and GEMET. GEMET is much smaller than UMTHES (which has been one of its sources) but it is available in 29 languages. 8 http://www.semantic-network.de 9 http://isotopicmaps.org 10 http://www.eionet.europa.eu/gemet Linked Environment Data (Draft, June 2010) 5 The second vocabulary from the EEA is the EUNIS biodiversity database, with a focus on species. EUNIS has been published in RDF early this year, using several properties from the Darwin Core vocabulary. The third example is the Environmental Applications Reference Thesaurus (EARTh) from Italy, which has been published in SKOS and linked with EUNIS as well. The bottom of Figure 3 shows some exemplary observation data which we plan to publish as linked data in this context. The German ESB is the starting point in this case, and we will try to motivate international partners (I-ESB) to join this Linked Data cloud. The ESB reports the accumulation of pollutants/substances in defined samples at specific places with respect to time but is not itself the specialist that can exhaustively describe these reference elements. Hence, the data has to be linked to specialized information about each of these parameters. For substances, for example, the links could point to the corresponding substance information in the GSBL, for species (as test subjects) to the EUNIS, for places to the Geo Thesaurus of the SNS, for time references to the Environment Chronicle (SNS). The Environmental Thesaurus (UMTHES) forms a layer on top of it, which in turn is linked to the international GEMET. Each data record of the ESB can be directly linked to the pieces of information of these specialized services. Ideally, the specialist information links back to the data records, thereby enabling bi-directional navigation. Additionally to all the previously mentioned information systems, there are numerous specialists that are not provided by authorities, e.g. Chemical Entities of Biological Interest (ChEBI), or GeoNames. The question, whether these are to be linked as well, is a political one: The technological prerequisites are fulfilled. 3. RDF Representation In order to put the linking mechanism to work and being able to directly access a given reference, a RDF data representation for all participating systems needs to be created. It is based on HTTP URIs (Web addresses) and a generic data model that has triples (subject/predicate/object) as its sole constituent. Subject and predicate are always encoded as HTTP URIs, the object can be an URI as well, or a literal (e.g. a number or a character string). For examples, please refer to the participants’ models in the following sections. This approach forms the basis for describing and applying individual models (RDF Schema or „vocabulary“) that are broadly comparable to object-relational models but can be semantically richer. Numerous RDF vocabularies have already been established. These vocabularies can and should be used, combined, and extended whenever possible and needed. In the following we use an ESB data example (Figure 4) with a striking peak in 2004. 11 http://eunis.eea.europa.eu/ 12 http://rs.tdwg.org/dwc/ 13 http://uta.iia.cnr.it/earth_eng.htm 14 http://eunis.eea.europa.eu/species.jsp 15 http://www.ebi.ac.uk/chebi/ 16 http://www.geonames.org/ 6 Maria Rüther, Joachim Fock, Thomas Bandholtz, Till Schulte-Coerne Figure 4 ESB data example It is assumed that during earthworks and dike reconstructions following the Elbe flood of 2002, high quantities of alpha-HCH and beta-HCH were released from soils and dumping grounds around Bitterfeld and entered the river Elbe via the river Mulde. The increasing Mulde-contamination is reflected in elevated levels of alpha-HCH and beta-HCH in bream since 2003 with peak concentrations in 2004 resp. 2005 in both, the river Mulde and river Elbe. The RDF examples in the following sub-chapters will demonstrate how this peak gets annotated by links between ESB and SNS. 3.1 RDF Model for the Environment Speci