The landscape of spatial data infrastructures (SDIs) is changing. In addition to traditional authoritative and reliably sourced geospatial data, SDIs increasingly need to incorporate data from non-traditional sources, such as local sensor networks and crowd-sourced message databases. These new data come with variable, loosely defined, and sometimes unknown provenance, semantics, and content. The next generation of SDIs will need the capability to integrate and federate geospatial data that are highly heterogeneous. These data comprise a vast observation space: they could be represented in many forms, will have been generated by a variety of producers using different processes and will have originally been intended for purposes that may differ markedly from their later use. There are several discriminative dimensions along which we can describe the properties of the data found in SDIs, such as the data structure, the spatial framework (e.g., field, image, or object-based), the semantics of the attributes, the author or producer, the licensing, etc. These dimensions define a universe of model possibilities for data in an SDI, known as a model space. A core research challenge remains to recognise and resolve to the degree possible a comprehensive set of model dimensions that will enable us to characterise the many possible models by which geospatial data can be represented. A second challenge is to describe the transformations within and between models, and the ways in which these transformations change aspects of the underlying model. Despite recent movement toward semantically described services for SDIs, the scope and range of descriptive dimensions for geospatial data are underspecified. In this paper we present a diverse set of important dimensions that point to a series of challenges for data integration and then describe how both traditional and emergent datasets can be characterised within these dimensions, and point to some interesting differences. 1 Introduction: The evolving role of the SDI National and regional spatial data infrastructures (SDI) were originally conceived to be centralised geospatial data repositories containing data that came largely from authoritative sources (Masser, 1999; Groot and McLaughlin, 2000; Jacoby et al., 2002). The advent of local sensor webs, web 2.0 and so-called volunteered geographic information (VGI) i.e., Copyright c © by the paper’s authors. Copying permitted only for private and academic purposes. In: S. Winter and C. Rizos (Eds.): Research@Locate’14, Canberra, Australia, 07-09 April 2014, published at http://ceur-ws.org voluminous geospatial data that are made available from a multitude of sources of varying quality and that are often uncontrolled in terms of their: creation process, representation and contenthas changed the landscape (Goodchild, 2007; Budhathoki et al., 2008). While not produced by authoritative agencies, these data can represent better coverage of specific geospatial phenomena or be more timely due to their distributed and unconstrained methods of generation (Coleman et al. (2009)). Thus in many cases they are, in fact, of higher value, e.g., for time critical tasks in emergency response. However, their domain content (or the tasks to which they can be usefully applied) may not be known in advance and may require post-processing to extract. What then is the role of the spatial data infrastructure in such an environment, given that many of the data that analysts and policy makers will find useful may come from such widely varying and incompatible sources? And how can its users understand the utility and reliability of the data products derived from mashing up such heterogeneous datasets? It is a challenging problem because in order to effectively match data to a specific application need we must consider several aspects of the data at once, including not only the spatial framework of the data but also the provenance, semantics, context of authorship, access rights, etc. (what we might call the pragmatics of the data to differentiate it from the geospatial semantics of the data) (Pike and Gahegan, 2007; Gahegan et al., 2009). Recent work in merging VGI with SDI has advocated for better semantic representation, using formal languages from the semantic web, and while this is a good step we argue that a more holistic approach is necessary (Janowicz et al., 2010). This is not a new research problem, but to date the relevant research in GIScience has focused on piecemeal solutions to specific strands of the problem, tackled in isolation that do not work together in the orchestrated way that would be necessary to build a more advanced SDI. In the following section we will introduce our vision for a next-generation SDI. In section 3 we present a diverse set of important dimensions for next-generation SDI. We follow with example for how data transformation can be represented in terms of those dimensions and show examples for both authoritative and other less formal data. Finally, we conclude with a summary of why we think this is an important time to reconsider the role of SDI as a vital component in a connected approach to the science process: i.e., linked science or eScience (Hey and Trefethen, 2005; Mäs et al., 2011). 2 Next generation spatial data infrastructure In a typical GIS problem-solving workflow, we typically encounter distinct steps such as the following: 1. Locate, gain access to and – to some extent – understand the limitations of each dataset we intend to use. Currently, SDI and specifically their data catalog and search tools can sometimes help here. 2. Transform the datasets we will use into a consistent form (model), for example by re-projecting, converting from raster to vector or harmonising the semantics. The decisions we make here can have profound implications for the quality of the data. 3. Combine the datasets via an analytical workflow of some kind. 4. Assess the accuracy and reliability of the result and (possibly) publish it back into the SDI. The geospatial datasets that we might wish to combine could be highly heterogeneous. They will be represented in many forms, will have been generated by a variety of producers using different processes and may have originally been intended for purposes that are different from their present use. Each of these ideas, and others, form the dimensions along which we can describe the properties of a dataset found in an SDI, such as the data structure, the spatial framework (e.g., field, image, or object-based), the semantics of the attributes, the author or producer, etc. These dimensions define a universe of model possibilities, or model space, for datasets that the SDI interacts with. In order to capitalise on the value of such a wide variety of data, we need to detail the many ways in which we might integrate or transform data that reside at different points in model space. A core research challenge, therefore, is to recognise and resolve to the degree possible a comprehensive set of characteristic model dimensions1 that enable us to characterise transformations within and between data models. These dimensions provide us with a conceptual framework to understand the ways that data are transformed and are made fit-for-purpose. A data source, such as the Landsat 7 sensor or a crime logging system has the potential to create a series of datasets, so a source can be represented in this model space similarly to an individual dataset. But rather than being represented as one point in model space, a data source may be represented as ranges along certain dimensions, describing the potential values that a specific dataset may inherit. For example, each Landsat 7 dataset will have a unique timestamp and a spatial footprint drawn from a set of possibilities defined by the orbital characteristics. But the spatial framework will always be an image and the data will always be packaged into a raster data structure.2 1The term ‘dimension’ is used loosely here in a cognitive sense and does not imply an ordering of values, as in the mathematical sense of the word. 2Interestingly, the error characteristics of the datasets change over time as the sensor picks up damage, so this too is a range rather than a point. Moving a dataset from one point in model space to another point will incur a series of costs related to: the work done, changes in accuracy or resolution, changes in semantics, etc. We routinely DO move data in model space but we typically do not account for all the changes that ensue. We aim to address this shortcoming by representing the model space and describing (as richly as we can) what happens to datasets that are transformed from one point in this space to another. We can assign a cost function for each dimension (i.e., a distance metric) that allows us to account for the cost of transforming data from one model to another. A data transformation is represented as a function that takes one or more datasets and their associated models and returns a tuple consisting of a new dataset and its location within the model space. Finally, each model in the space has its own sets of behaviours, translators (to other models), constraints, and supported data structures. An important difference between the traditional SDI and next-generation SDI is that because of the heterogeneity of producers, unlike the traditional model of a centralised repository, the next-generation SDI will be distributed and federated. It will thus need to incorporate data from disparate sources that are not controlled from within the SDI, in line with the paradigm of linked data (Bizer et al., 2009; Schade et al., 2010). Perhaps more importantly, the idea of SDI as simply an ingester of data will change. Geospatial data sources such as sensor networks and social media feeds are increasingly real-time and configurable. For example, a sensor network may be able to sample some phenomenon every day, hour or minute and may be able to
[1]
Ian Williamson,et al.
Developing a common spatial data infrastructure between State and Local Government--an Australian case study
,
2002,
Int. J. Geogr. Inf. Sci..
[2]
P. Georgiadou,et al.
Spatial data infrastructure SDI and e-governance : a quest for appropriate evaluation approaches
,
2006
.
[3]
H. Onsrud,et al.
Protecting personal privacy in using geographic information systems
,
1994
.
[4]
Stephan Mäs,et al.
Linking the Outcomes of Scientific Research: Requirements from the Perspective of Geosciences
,
2011,
LISC.
[5]
M. Bishr,et al.
Geospatial Information Bottom-Up: A Matter of Trust and Semantics
,
2007,
AGILE Conf..
[6]
Krzysztof Janowicz,et al.
Constructing geo-ontologies by reification of observation data
,
2011,
GIS.
[7]
Andrew U. Frank,et al.
Procedure to Select the Best Dataset for a Task
,
2004,
GIScience.
[8]
M. Goodchild.
Citizens as sensors: the world of volunteered geography
,
2007
.
[9]
Max J. Egenhofer,et al.
Toward the semantic geospatial web
,
2002,
GIS '02.
[10]
Robert Jeansoulin,et al.
Towards spatial data quality information analysis tools for experts assessing the fitness for use of spatial data
,
2007,
Int. J. Geogr. Inf. Sci..
[11]
Stephan Winter,et al.
Talking About Place Where it Matters
,
2013
.
[12]
Tom Heath,et al.
Open Data Commons, a License for Open Data
,
2008,
LDOW.
[13]
Anne E. Trefethen,et al.
Cyberinfrastructure for e-Science
,
2005,
Science.
[14]
Mark Gahegan,et al.
Connecting GEON: Making sense of the myriad resources, researchers and concepts that comprise a geoscience cyberinfrastructure
,
2009,
Comput. Geosci..
[15]
Deborah L. McGuinness,et al.
PROV-O: The PROV Ontology
,
2013
.
[16]
Krzysztof Janowicz,et al.
A Geo-semantics Flyby
,
2013,
Reasoning Web.
[17]
Mark Gahegan,et al.
A Situated Knowledge Representation of Geographical Information
,
2006,
Trans. GIS.
[18]
Richard Groot,et al.
Geospatial Data Infrastructure : Concepts, Cases, and Good Practice
,
2000
.
[19]
Ian Masser,et al.
All shapes and sizes: the first generation of national spatial data infrastructures
,
1999,
Int. J. Geogr. Inf. Sci..
[20]
Mark Gahegan,et al.
Specifying the transformations within and between geographic data models
,
1996,
Trans. GIS.
[21]
Mark Dredze,et al.
You Are What You Tweet: Analyzing Twitter for Public Health
,
2011,
ICWSM.
[22]
Werner Kuhn,et al.
Trust and Reputation Models for Quality Assessment of Human Sensor Observations
,
2013,
COSIT.
[23]
Mark Gahegan,et al.
The Effects of Licensing on Open Data: Computing a Measure of Health for Our Scholarly Record
,
2013,
International Semantic Web Conference.
[24]
Michael F. Worboys,et al.
GIS : a computing perspective
,
2004
.
[25]
Mark Gahegan,et al.
Experiments to Examine the Situated Nature of Geoscientific Concepts
,
2007,
Spatial Cogn. Comput..
[26]
David Coleman,et al.
Volunteered Geographic Information: the nature and motivation of produsers
,
2009,
Int. J. Spatial Data Infrastructures Res..
[27]
Yogesh L. Simmhan,et al.
A survey of data provenance in e-science
,
2005,
SGMD.
[28]
Ian Horrocks,et al.
Reasoning Web. Semantic Technologies for Intelligent Data Access
,
2013,
Lecture Notes in Computer Science.
[29]
Schade Sven,et al.
Augmenting SDI with Linked Data
,
2010
.
[30]
Miriam J. Metzger,et al.
The credibility of volunteered geographic information
,
2008
.
[31]
Edward A. Lee,et al.
Scientific workflow management and the Kepler system
,
2006,
Concurr. Comput. Pract. Exp..
[32]
Mark Gahegan,et al.
Beyond ontologies: Toward situated representations of scientific knowledge
,
2007,
Int. J. Hum. Comput. Stud..
[33]
Bertram C. Bruce,et al.
Reconceptualizing the role of the user of spatial data infrastructure
,
2008
.
[34]
James Frew,et al.
Lineage retrieval for scientific data processing: a survey
,
2005,
CSUR.
[35]
Christoph Stasch,et al.
Semantic Enablement for Spatial Data Infrastructures
,
2010,
Trans. GIS.
[36]
Tim Berners-Lee,et al.
Linked Data - The Story So Far
,
2009,
Int. J. Semantic Web Inf. Syst..