Operating system support for persistent systems: past, present and future

ion, containers can exist independently of computation. Thus, at certain times a container may be active and support multiple concurrent computations, while at other times it may be purely passive and void of activity. Thelocusis the only abstraction over execution. Loci are scheduled by the kernel and execute within separate address spaces. The address space of a locus is composed from views of contiguous regions of various containers called mappings. In the simplest case, the address space of a locus contains a single mapping allowing it to access the contents of a particular container known as its ho t. Although each locus executes within a separate address space, a single container may host many loci allowing them to share the code and data stored within. A locus is not bound to any particular container and may move to a new host container via the invocationmechanism. Thus, containers and loci are completely orthogonal abstractions. The Grasshopper kernel must be able to control access to abstractions such as containers and loci. Furthermore, a naming mechanism is required to identify a particular container or locus to the kernel when performing system calls. Grasshopper provides the capabilityabstraction for this purpose [ 37]. Copyright 2000 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2000;30:295–324 304 A. DEARLE AND D. HULSE A capability is conceptually a typed reference to an instance of a kernel abstraction combined with a set of access rights. The rights determine the operations that the holder of the capability may perform on the referend. Capabilities may be held by both containers and loci and are stored in a list associated with the holder. Capability lists are maintained within the kernel to prevent forgery. Mapping is a viewing mechanism used to compose address spaces from regions of other address spaces. In particular, mapping allows a contiguous region from one address space to be viewed from within a region of the same size in another address space. Changes made to data visible in either region are immediately reflected in both regions. Grasshopper provides two forms of mapping, container mappingand locus mapping . Container mapping allows the address space of a container to be composed from the address spaces of other containers. Since the address space of a locus is composed of regions mapped from the address spaces of various containers, any mappings affecting these containers will also be visible to the locus. For this reason, the effect of container mappings is said to beglobal. In contrast, locus mapping allows loci to install privatemappings to container regions within their address spaces. These are typically used to provide access to per-locus data structures such as stacks. From an external view of the Grasshopper system, it appears as if all data is stored within containers. In reality however, the actual data is maintained by managers[38]. A manager is a distinguished container holding code and data to support the transparent movement of data between primary and secondary storage. They are the only component of Grasshopper in which the distinction between long and short-lived data is apparent. To support the implementation of various different store architectures, managers have control over the virtual memory system in a similar manner to external pagers in microkernel operating systems such as Mach [ 7] and Chorus [ 29]. A final aspect of the Grasshopper system is its consistency model. The kernel performs lazy causal consistency tracking between loci at a page granularity [ 39]. This is achieved by the kernel tracking page faults and maintaining inter-locus dependency information. When a locus makes a resilient copy of its state, known as a snapshot , a dependency matrix describing its dependencies on other loci is recorded on non-volatile storage by the kernel. Using this information at recovery time or from transient data structures, the kernel can always determine the latest consistent global state. It is this state that is restored in the event of a failure or shutdown. Whilst meeting its original design goals, like the other systems described in this section, Grasshopper suffered from a number of problems. These are examined in more detail in Section 4, which uses Grasshopper as a case study. 4. PROBLEMS WITH OPERATING SYSTEM SUPPORT FOR PERSISTENCE Being designers and implementers of Grasshopper the authors have a much more detailed understanding of its good and bad points. For this reason, we will use Grasshopper as a case study to throw light on the problems of providing direct support for persistence in the operating system. It should be stressed that we do not believe Grasshopper to be any better or worse than any of the other systems described in Section 3. One of the design goals of Grasshopper was to permit multiple store architectures to coexist on the same machine without changing the operating system. This was motivated by the desire to permit experimentation with different storage architectures. To fulfil this requirement, the concept of Copyright 2000 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2000;30:295–324 OPERATING SYSTEM SUPPORT FOR PERSISTENT SYSTEMS 305 a managerwas developed to provide user-level control over the storage of persistent data. Since each container can have its own manager, many different store architectures can co-exist within the same system allowing each application to manage its persistent data in the most appropriate manner. The problem with managers, and with external pager mechanisms in general, is that moving control over the virtual memory subsystem out of the kernel and into a user-level server adds to the latency with which page faults can be serviced [ 40]. In the best case when the required page is already resident, it is necessary to call and return across the user-level/kernel boundary threetimes. Each time this occurs, certain registers must be saved and restored and capabilities must be checked to enforce protection. In the worst case when the page must be retrieved from disk, fiveboundary crosses are required. Although the cost of the disk I/O far outweighs the cost of the boundary crosses, it represents a significant overhead that could be better spent running other loci. Another source of inefficiency is the duplication of information within managers and the kernel. The most notable example of this involves the virtual address translation tables. Within the current implementation of the Grasshopper kernel, this information is effectively stored in three separate locations. First, the information is held within the three-level hierarchical page tables (termed cont xts) used by the memory management hardware. These contexts are typically sparsely populated making them expensive to maintain on a permanent basis. Therefore, a fixed-size pool of memory is used to cache the most recently used contexts. Since contexts merely cache recently used address translations, a permanent record is maintained by the kernel using Local Container Descriptors(LCDs) [41]. An LCD represents a mapping from container addresses to physical addresses and contains an entry for every resident page of the container with which it is associated. Each LCD stores address translations very concisely within an extensible hash table. During the resolution of page faults, managers enter address translations into the appropriate LCDs via a system call, and the kernel uses this information to create corresponding context entries when required. It was originally intended that the presence of an LCD entry for a particular page would stop the kernel from notifying the manager to resolve future faults on the same page. However, this policy prevents managers from using page faults to implement transactions. Therefore, the current implementation passes all faults on a page through to the manager, requiring them to track the current set of address translations. Although they can obtain this information using a system call to query the appropriate LCDs, it is more time-efficient if a separate hash table is maintained within the manager. To address some of the protection problems described in Section 2, a Grasshopper container may host many loci simultaneously. When this occurs, the host container forms the basis for the address space of each locus. However, since a locus can privately map regions from other containers into its address space, it is necessary for loci to have separate address spaces. Restating this in another way, Grasshopper does not provide a way for two loci to share the same address space other than by arranging for them to share the same set of mappings. Thus, when context switching from one locus to another, the virtual address space must also be switched to that of the new locus. This is unfortunate, since loci were intended to be extremely lightweight processes somewhat akin to threads, although in this implementation they are forced to use a heavyweight context switch much like conventional processes. The Grasshopper system suffered from several problems associated with locus synchronisation. Frequently, these were caused by management conflicts between the kernel and user level. For example, when a locus enters a container for the first time, the kernel must lock part of the container to perform a few atomic operations associated with stack creation. However, portions of the container may have Copyright 2000 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2000;30:295–324 306 A. DEARLE AND D. HULSE Figure 1. (a) Traditional OS model; (b) persistence oriented OS model. been locked by other loci making it impossible for the invocation to take place atomically. This conflict is caused by a combination of inappropriate abstractions and kernel policy decisions. Whilst the Grasshopper consistency model did not cause any interference at user level, it did impose a causal consistency model on all loci whether they required it or not. It therefore

[1]  M. Frans Kaashoek,et al.  Software prefetching and caching for translation lookaside buffers , 1994, OSDI '94.

[2]  Alan Dearle,et al.  Casper: A Cached Architecture Supporting Persistence , 1992, Comput. Syst..

[3]  Robin Fairbairns,et al.  The Design and Implementation of an Operating System to Support Distributed Multimedia Applications , 1996, IEEE J. Sel. Areas Commun..

[4]  Seán Baker CORBA distributed objects - using ORBIX , 1997 .

[5]  Partha Dasgupta,et al.  The Design and Implementation of the Clouds Distributed Operating System , 1989, Comput. Syst..

[6]  Partha Dasgupta,et al.  The Clouds distributed operating system: functional description, implementation details and related work , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[7]  Malcolm P. Atkinson Programming Languages and Databases , 1978, VLDB.

[8]  J. Eliot B. Moss,et al.  Working with Persistent Objects: To Swizzle or Not to Swizzle , 1992, IEEE Trans. Software Eng..

[9]  James Leslie Keedy,et al.  Support for Objects in the MONADS Architecture , 1989, POS.

[10]  Frans Henskens,et al.  An examination of operating system support for persistent object systems , 1992, Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences.

[11]  Malcolm P. Atkinson,et al.  An orthogonally persistent Java , 1996, SGMD.

[12]  Luca Cardelli,et al.  Mobile Ambients , 1998, FoSSaCS.

[13]  Jochen Liedtke,et al.  On micro-kernel construction , 1995, SOSP.

[14]  Dawson R. Engler,et al.  Exokernel: an operating system architecture for application-level resource management , 1995, SOSP.

[15]  John Rosenberg,et al.  MONADS-PC - a capability-based workstation to support software engineering , 1985 .

[16]  E. B. Moss,et al.  Nested Transactions: An Approach to Reliable Distributed Computing , 1985 .

[17]  Rudolf Bayer,et al.  A database cache for high performance and fast restart in database systems , 1984, TODS.

[18]  Sape J. Mullender Amoeba-high performance distributed computing , 1989 .

[19]  Claude Kaiser,et al.  CHORUS Distributed Operating System , 1988, Comput. Syst..

[20]  Raymond A. Lorie,et al.  Physical integrity in a large segmented database , 1977, TODS.

[21]  William J. Bolosky,et al.  Mach: A New Kernel Foundation for UNIX Development , 1986, USENIX Summer.

[22]  Ken Thompson,et al.  The UNIX time-sharing system , 1974, CACM.

[23]  John Rosenberg,et al.  Operating system support for persistant and recoverable computations : Latest developments in operating systems , 1996 .

[24]  Alan Dearle,et al.  On page-based optimistic process checkpointing , 1995, Proceedings of International Workshop on Object Orientation in Operating Systems.

[25]  Brian N. Bershad,et al.  User-level interprocess communication for shared memory multiprocessors , 1991, TOCS.

[26]  Michael Stonebraker,et al.  Operating system support for database management , 1981, CACM.

[27]  Alan Dearle,et al.  Toward Ubiquitous Environments for Mobile Users , 1998, IEEE Internet Comput..

[28]  Alfred L. Brown,et al.  Persistent object stores , 1988 .

[29]  Robbert van Renesse,et al.  Amoeba A Distributed Operating System for the 1990 s Sape , 1990 .

[30]  Norman Hardy,et al.  KeyKOS architecture , 1985, OPSR.

[31]  Rasool Jalili,et al.  Using directed graphs to describe entity dependency in stable distributed persistent stores , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[32]  Alan Dearle,et al.  A Flexible Persistent Architecture Permitting Trade-off Between Snapshot and Recovery Times , 1996 .

[33]  Alan Dearle,et al.  A Log-Structured Persistent Store , 1996 .

[34]  Elliott I. Organick,et al.  The multics system: an examination of its structure , 1972 .

[35]  James Leslie Keedy,et al.  A massive memory supercomputer , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[36]  Brian N. Bershad,et al.  Extensibility safety and performance in the SPIN operating system , 1995, SOSP.

[37]  Richard C. H. Connor,et al.  The Napier88 Reference Manual , 1997 .

[38]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[39]  John Rosenberg,et al.  The grand unified theory of address spaces , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[40]  Greg Nelson,et al.  Systems programming in modula-3 , 1991 .

[41]  Irving L. Traiger,et al.  A history and evaluation of System R , 1981, CACM.

[42]  Evangelos P. Markatos,et al.  Implementation Issues for the Psyche Multiprocessor Operating System , 1989, Comput. Syst..

[43]  Hamid Pirahesh,et al.  ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging , 1998 .

[44]  Mahadev Satyanarayanan,et al.  Lightweight Recoverable Virtual Memory , 1993, SOSP.

[45]  Roy H. Campbell,et al.  Choices (class hierarchical open interface for custom embedded systems) , 1987, OPSR.

[46]  Andrew S. Tanenbaum,et al.  A Comparison of Two Distributed Systems: Amoeba and Sprite , 1991, Comput. Syst..

[47]  R. MayesDepartment,et al.  Trends in Operating Systems towards Dynamic User-level Policy Provision , 1994 .

[48]  David R. Cheriton The V Kernel: A Software Base for Distributed Systems , 1984, IEEE Software.

[49]  John Rosenberg,et al.  Grasshopper: An Orthogonally Persistent Operating System , 1994, Comput. Syst..

[50]  Partha Dasgupta,et al.  Linking consistency with object/thread semantics: an approach to robust computation , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[51]  Vivek Singhal,et al.  Texas: An Efficient, Portable Persistent Store , 1992, POS.

[52]  David R. Cheriton,et al.  A caching model of operating system kernel functionality , 1995, OPSR.

[53]  John Rosenberg,et al.  The MONADS Architecture - A Layered View , 1990, Workshop on Persistent Objects.