A Content Integrity Service For Long-Term Digital Archives

We present a content integrity service for long-lived digital documents, especially for objects stored in long-term digital archives. The goal of the service is to demonstrate that information in the archive is authentic and has not been unintentionally or maliciously altered, even after its bit representation in the archive has undergone one or more transformations. We describe our design for an efficient, secure service that achieves this, and our implementation of the first prototype of such a service that we built for HP’s Digital Media Platform. Our solution relies on one-way hashing and digital time-stamping procedures. Our service applies not only to transformations to archival content such as format changes, but also to the introduction of new cryptographic primitives, such as new one-way hash functions. This feature is absolutely necessary in the design of an integrity-preserving system that is meant to endure for decades. Introduction Information in a digital archive can include complex multipart documents. In a long-term archive these documents may be expected to undergo multiple transformations during their lifetime, including, for example, format changes, modifications to sub-parts and to accompanying metadata. Skeptical users of a digital archive may desire, or in some case may be legally required, to verify the integrity of records that they have retrieved from the archive. All typical algorithmic techniques for verifying the integrity of a digital object begin with a representation of the object in question as a sequence of bits. When digital objects are transformed in any nontrivial way, their bit representations are changed as well, so that these algorithmic techniques no longer apply to the transformed object. In fact, it is the usual aim of a cryptographic technique for proving integrity that it ”fail”—more precisely, that it correctly succeed in proving lack of integrity—when even a single bit in the object’s representation is changed. In this work we describe an efficient and secure Content Integrity Service (CIS) that solves this problem, which we designed and implemented as a service on the Digital Media Platform (DMP) [1]. Background The basic building blocks of our solution are cryptographic hash functions and time-stamping procedures. Throughout this article we refer to the objects of concern in a digital archive or repository simply as “documents”. Hash functions A cryptographic hash function is a fast procedure H that compresses input bit-strings of arbitrary length to output bitstrings (called hash values) of a fixed length, in such a way that it is computationally infeasible to find two different inputs that produce the same hash value. (Such a pair of inputs is called a collision for H.) For any digital document x, its hash value v = H(x) can be used as a proxy for x, as if it were a characteristic “fingerprint” for x, in procedures for guaranteeing the bit-by-bit integrity of x [2]. Time-stamping A digital time-stamping scheme is a procedure that solves the following problem: given a digital document x at a specific time t, produce a time-stamp certificate c that anyone can later use (along with x itself) to verify that x existed at time t. Certificates that will pass the verification test should be difficult to forge [3]. There are two different families of time-stamping algorithms, those using digital signatures (hash-and-sign) and those based entirely on cryptographic hash functions (hash-linking). In what is sometimes called a hash-and-sign time-stamping scheme, the time-stamp certificate for a document consists of a digital signature computed by a Time-Stamping Authority (TSA) for the document and the time of signing. In practice the TSA usually digitally signs the hash of the document rather than the document itself, and hence the name. This has two major drawbacks as a tool for long-term archives: (1) It requires the assured existence of trustworthy archived key-validity history data, in order to check the validity of the TSA’s public key. It is a problem for any TSA to manage such a key-validity database over extended periods of time, let alone integrating it with currently deployed commercial PKIs (publickey infrastructures). (See [4] for a proposed solution.) (2) The trustworthiness of the certificate depends entirely on an assurance that the TSA’s private signing key has never been compromised. This is an unacceptable premise for long-term archives. The combination of increasing computational resources with advances in cryptanalytic techniques can be expected to render current digital-signature algorithms ineffective and susceptible to attacks. More simply, the private key of a TSA may leak or be stolen. Either way, an adversary would have the ability to produce certificates for any document, with an arbitrary claimed time, past or future. For the CIS, we chose a time-stamping technique called hash-linking. In this technique, the hash value of the document to be time-stamped is combined with other hash-values received during the same time period to create a witness hash value. This witness value is then stored by the TSA or published as a widely witnessed event. This kind of linking makes it computationally infeasible for an adversary to back-date a document, since that would entail computing hash collisions for the witness values (or their hash preimages). This technique relies only on the collisionresistance properties of hash functions, and does not have any secrets or keys that need to be securely protected over extended periods of time [5, 6, 7]. In one implementation of hash-linking, the witness hash values themselves are linked in a hash chain, and hash values within each time period are combined using a Merkle hash tree [8]. For example, Figure 1 illustrates this process for an interval during which the requests y1, . . . ,y4 were received. In this diagram, H12