Preserving Records in the Cloud : A Model to enhance Metadata Interoperability in a Cloud Environment

An increasing number of organisations are using cloud computing to create and store digital records. The problems relating to the preservation of electronic documents in general are well known, and steps can be taken to ensure systems can provide long-term accessibility and readability of electronic records. However, with cloud computing, the management and responsibility of infrastructure, systems and data may no longer reside in the organisation in which the electronic records are created. For this reason, many producers of digital contents choose to transfer these to a dedicated archive. However, since such a transfer can be both costly and time-consuming, this raises the question of how the process can be simplified and what can be done to increase interoperability between producer and archive, when one or both of these are in the cloud. This thesis examines cloud computing from an archiving perspective and how this new technology fits with existing models of digital archiving, exemplified by the Open Archival Information System (OAIS) reference model. In OAIS, digital archives receive their records and accompanying metadata from a producing organisation or institution in a predetermined package format. The archive and producer agree on a set of requirements for submission, defined by the archive in order to ensure an easy ingest. Once the contents have been ingested, they are stored and managed in a data centre, where the archive has complete control of the technological infrastructure and digital objects. Using this infrastructure and digital archiving software, the archive provides access to archive users, while ensuring the ongoing preservation of the archiving collection. Based on the above reference model, the thesis identifies four areas where the OAIS model does not address the requirements of a cloud environment. 1) The fact that the functional entities in OAIS are interdependent, makes it difficult to transfer responsibility for parts of an OAIS archive to an external service provider. For example, if an organisation is looking for a storage solution offering bit-level integrity for digital objects to use as a back-end for an archiving i system, this would involve overlapping functionality from the Management, Data Archiving and Archival Storage entities. 2) In OAIS, the burden of creating Submission Information Packages (SIP) is left to producers, who must meet the requirements of the OAIS archive. Many of these requirements are related to metadata. An archive will specify a number of mandatory metadata elements that must be included in SIPs and must comply with the formatting and schema rules for submission packages. For a producer, complying with this can be very resource intensive, depending on the strictness of the requirements. This can lead to producers holding on to records for long periods, before submitting them in bulk, which can significantly delay preservation planning. 3) With cloud computing, there is less need to include digital objects Information Packages With a shared and trusted platform, producers only need to provide the information (URI or similar) of where the digital objects are stored. However, the OAIS Model does not specify the requirements and functionality of such a shared platform. 4) The OAIS model does not cover the initial stages of the Document Lifecycle (i.e., the Create, Use, Manage stages). It can be argued that these stages lie outside the scope of an archive. However, the nature of the events in these stages and how well they are documented can have a huge impact on how easy it will be to carry out preservation work later on. Based on the findings of the examination, a model for a cloud archiving system to improve interoperability between Producer and Archive is proposed using concepts and information types from OAIS. The information that comprises an OAIS information Package can be arranged according to complexity. There is an increase in complexity from the simple digital object to the comprehensive Information Package. This progression from simple to complex is comparable to how information flows in a layered model, where information in one layer is used, manipulated and passed to a higher layer. This model reflects the development in complexity of digital objects, and is similar to the document lifecycle, where a document goes through a number of stages over time. The proposed model allows the sharing of functionality and digital objects by making these available as services to above layers. The model covers the entire document lifecycle, making archive functionality such as preservation planning possible at an early stage in the document lifecycle and helps to simplify records transfer. The model is explained in a theoretical case study, using the records transfer process from Japanese government agencies to the National ii Archives of Japan as an example. Whereas the proposed layered model serves as a basic conceptual model, it does not solve the problem of how SIPs should be structured. To describe the metadata that should be included in SIPs and where in the layered model it originates, the thesis proposes a metadata application profile for cloud archives. As interoperability is an essential part of the proposed cloud solution (referring here not only to interoperability between producer and archive but also to potential interoperability between different digital archives), the author chose to design the application profile using the Singapore Framework for Dublin Core Application Profiles to define the functional requirements, domain model and description set profile that form the basis of the proposed application profile. In the profile, METS is used as a transmission and package format, extending it with metadata from the PREMIS data dictionary and Dublin Core Metadata Element Set. METS was chosen for a number of reasons: It allows the inclusion of other metadata schemas, it can express structurally complex objects and several solutions using METS already exist. PREMIS defines core preservation metadata (semantic units) needed to support long-term preservation. Using the proposed application profile, an example METS information package is created with predefined criteria. It was found that the application profile can simplify metadata provision for business systems, compared to systems that do not allow pre-registration. Furthermore, there is a potential for the automation of metadata provision, further reducing the amount of metadata that must be explicitly provided. A further examination was performed on the metadata that must be provided by Producers and that cannot be automated. It was found that many of the elements described complicated attributes of digital objects, such as structural relations, encryption or rights information. The more complex the digital objects to be preserved, the more metadata must be provided by business systems, increasing cost for producers. The proposed model and Application Profile answered the research questions dealing with how a model can be developed in such a way that it integrates the requirements of both producer and archive when building a cloud-based digital archive and what such a system would look like. However, one major barrier to actual implementation is the lack of a formal semantic model and common vocabulary expressed in a machine-readable format. The thesis proposes an

[1]  Shigeo Sugimoto,et al.  Archiving as a service: a model for the provision of shared archiving services using cloud computing , 2011, iConference '11.

[2]  He Jia-sun Life Cycle of Electronic Records , 2001 .

[3]  Adrienne Muir,et al.  Managing Electronic Records , 2006, Program.

[4]  L. Youseff,et al.  Toward a Unified Ontology of Cloud Computing , 2008, 2008 Grid Computing Environments Workshop.

[5]  Introduction: Definitions and Concepts , 2011 .

[6]  Rich Kaestner,et al.  The Basics of Cloud Computing. , 2012 .

[7]  Thomas Sandholm,et al.  What's inside the Cloud? An architectural map of the Cloud landscape , 2009, 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing.

[8]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[9]  D. Giaretta,et al.  Representation information for interoperability now and with the future , 2005, 2005 IEEE International Symposium on Mass Storage Systems and Technology.

[10]  Dan Brickley,et al.  SKOS Core: Simple knowledge organisation for the Web , 2005, Dublin Core Conference.

[11]  Shigeo Sugimoto,et al.  Preserving records in the cloud , 2011 .

[12]  Yoji Yamato,et al.  Survey of Public IaaS Cloud Computing API , 2012 .

[13]  Markus Enders A Mets Based Information Package For Long Term Accessibility Of Web Archives , 2010, iPRES.

[14]  Neal Leavitt,et al.  Is Cloud Computing Really Ready for Prime Time? , 2009, Computer.

[15]  Hilde van Wijngaarden PREMIS: PREservation Metadata: Implementation Strategies , 2004, iPRES.

[16]  Andreas Rauber,et al.  Plato: a preservation planning tool , 2008, JCDL '08.

[17]  Sarah Higgins PREMIS Data Dictionary for Preservation Metadata , 2009 .

[18]  Sally Vermaaten A Checklist and a Case for Documenting PREMIS-METS Decisions in a METS Profile , 2010, D Lib Mag..

[19]  Linda Cantara METS: The Metadata Encoding and Transmission Standard , 2005 .

[20]  Margaret L. Hedstrom,et al.  Digital Preservation: A Time Bomb for Digital Libraries , 1997, Comput. Humanit..

[21]  Chris Rose,et al.  A Break in the Clouds: Towards a Cloud Definition , 2011 .