Preserving Geospatial Data: The National Geospatial Digital Archive’s Approach

The National Geospatial Digital Archive (NGDA) is one of eight initial projects funded by the Library of Congress’s National Digital Information Infrastructure and Preservation Program (NDIIPP). The project’s overarching goal is to answer the question: How can we preserve geospatial data on a national scale and make it available to future generations? This paper summarizes the project’s work in four areas: analysis of the characteristics of geospatial data relevant to preservation; elucidation of the “relay” principles of long-term preservation; development of an OAIS-compliant archive system; and development of a wikiand repository-based format registry. Introduction The National Geospatial Digital Archive (NGDA), a partnership between the Map & Imagery Laboratory, Davidson Library, at the University of California at Santa Barbara, and Branner Earth Sciences Library at Stanford University, is one of eight initial projects funded by the Library of Congress’s National Digital Information Infrastructure and Preservation Program (NDIIPP). The project’s overarching goal is to answer the question: How can we preserve geospatial data on a national scale and make it available to future generations? Work on the project began in earnest in 2005 and immediately led to several new questions being posed: • What are the characteristics of geospatial data that impact preservation? • Given a desire to preserve information for a century or longer—a period of time far exceeding the lifetimes of the applications, platforms, and people involved in the information’s creation—is there any preservation architecture, or are there at least any general design principles or best practices, that can carry the information through a century of unforeseeable technological and social change? • Given a desire to preserve information on a large scale, can we define a minimal level or minimum standard of preservation that has a high chance of being achieved over the course of a century, without interruption or discontinuity, so that the information remains (at least potentially) as useful as when it was created, despite unforeseeable fluctuations in available resources devotable to the information’s curation over time, and fluctuations in interest in the information and in the information’s perceived value? 1 http://www.ngda.org/ 2 http://www.digitalpreservation.gov/ This paper summarizes NGDA’s work in answering these questions. In the next section we list characteristics of geospatial data relevant to preservation. In the subsequent two sections, we elucidate three principles of long-term preservation and describe a prototype archive system built by NGDA that satisfies those principles. Finally, we describe NGDA’s work in developing a wikiand repository-based format registry. Geospatial data characteristics Geospatial data refers to the wide variety of scientific and government-produced datasets that have a geographic component, and that can typically be viewed as representing a portion of the Earth’s surface in some way. This class of information encompasses remote-sensing imagery, aerial photography, maps, data produced by both fixed and mobile geographically-embedded sensors, and data created and processed by GIS (Geographic Information System) tools. The following are some characteristics of geospatial data that are relevant to its preservation. No uniform data model. Geospatial data spans a wide variety of data organizations: vector and raster; topological and non-topological; over domains both discrete and continuous. Geospatial applications and file formats support differing subsets and aspects of these data organizations, and to varying degrees. One attempt at defining a universal, public data model for geospatial data has been made, the USGS SDTS format, but it has failed to achieve widespread adoption. As a consequence, it is not possible to speak of “geospatial data” as a single type of quantity that can be handled by multiple, functionally equivalent applications and formats. Proprietary formats. Many geospatial formats, particularly GIS formats, are proprietary and therefore closely tied to applications. Furthermore, as is typical with formats driven by marketplace competition, they are frequently subject to backwardly incompatible revisions over time. Multiple granule sizes. In contrast to textual information, which has been successfully modeled using multi-page, (hyper)textual documents as the sole granule size, geospatial data is regularly processed at varying granule sizes. The granule sizes range from individual features having geographic location, geometry, and related attributes; to homogeneous, thematic layers of features; to integrated, heterogeneous databases. Data can be aggregated, disaggregated, and operated on with some fluidity. Each of these granularities has its uses, affords different functionality, and poses different preservation challenges. As a 3 http://mcmcweb.er.usgs.gov/sdts/ consequence, there is no single preservation problem for geospatial data; instead, choosing which level or granule size to address, and therefore identifying the preservation problem(s), is a first step of the process. Relational data systems. Geospatial data managed by GIS tools is more and more often being stored in “geodatabases”: relational databases with geographic extensions. The virtue of the geodatabase—that it provides a unified, seamless environment in which to store complex relationships among heterogeneous features—is also a bane for preservation, as it means that it is often not possible to extract individual components out of the database without losing information. And geodatabases inherit all the problems of preserving relational databases: the need to take snapshots of running database systems; storage of snapshots in proprietary database dump formats; complex dump formats; and large, monolithic snapshot files. Large size. The size of geospatial data is large by any measure, with datasets commonly having gigabyte granularities and with some datasets growing by terabytes per day. Long-lived programs. Geospatial datasets can be long-lived: satellite-based sensor programs may run for years, even decades. As a consequence, it becomes necessary to begin archiving datasets long before they are “finished.” Traditionally this has been addressed by binding datasets to storage systems that inevitably become obsolete even within the program’s lifetime, but archival systems of the future that hope to lower both the cost of preservation and the risk of information loss will need to be designed to allow easy turnover and handoff of ever-evolving components and technologies. Extensive context. Capturing and preserving enough of the context surrounding geospatial data to support the data’s future interpretation and use can be challenging. Whereas format information by itself is sufficient to support future renderability of multimedia documents (e.g., knowledge of the PDF format is sufficient to render PDF documents, and therefore usability by humans), geospatial data can require much more, and more complex, contextual information. Using remote-sensing imagery in scientific modeling requires detailed knowledge of platform and sensor characteristics, and in many cases calibration and processing steps as well. Strictly speaking, such contextual information constitutes metadata, but in practice, being voluminous, it is not handled as such (for example, it is not stored in metadata records bundled with the data). Implicit context. In many cases, the context surrounding geospatial data is implicit and embedded in small, relatively insular scientific communities. Dynamic data. Some datasets, particularly Climate Data Records (CDRs), may need to be periodically reprocessed from source datasets in response to corrections and improvements in calibration and Earth models. Thus the context for these datasets must include not only information for their use, but information for their (re)processing as well, including software, algorithms, workflows, ancillary calibration tables, and other artifacts. And, in addition to simply storing such information, it must be possible to re-execute workflows, implying that lineage relationships between datasets and source datasets must be actively maintained. In the larger view, science datasets reside in a dynamic ecosystem of related datasets, and to preserve a dataset means to preserve the dataset’s ability to function in that ecosystem. From these characteristics we conclude that several challenges arise in preserving geospatial data over those already imposed by the general digital preservation problem. Whereas a multimedia document typically resides within a single file, geospatial data may reside in complex, multi-file objects. Whereas the interpretation of a PDF document may be defined by the format label “PDF,” and in turn by an entry in a central format registry, geospatial data may require extensive, product-specific context to interpret. Whereas a thesis or journal article is fixed upon publication, geospatial data can remain dynamic indefinitely due to the lifetime of the generating program and the need to be periodically reprocessed. Relay-supporting preservation architectures We now turn to NGDA’s work on preservation architectures. In thinking about how information can be preserved, it is natural to focus on the system that will house the information: a system must be built to hold the information and make it accessible; the system’s purpose is (at least in part, if not wholly) to preserve the information; and hence, it is tempting to think, by building the system, the preservation of the information will have been addressed. This line of thinking is particularly attractive if the system supports preservation-related functionality such as format migration. But if our goal is to preserve the information for a century or longer, it is evident that any system, no matter how well-designed or well