Recoverable Distributed Shared Memory Under Sequential and Relaxed Consistency.

Abstract : Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. The complexity and overhead inherent in traditional message-passing checkpointing techniques can be reduced by taking advantages of specific properties of DSM. In this paper we show that, if designed correctly, a DSM system only needs to consider a subset of message-passing dependencies for correct rollback. A passive server model of DSM computation is described that allows a loosening of dependency restrictions by considering the events involved in interactions between nodes as atomic. An ownership timestamp scheme is used to eliminate many of the dependencies related to keeping directories consistent. The schemes can be implemented in DSM hardware by simply redesigning the directory at the network interface. Finally, we show that by relaxing the memory consistency model and using lazy release consistency, it is possible to further relax dependency restrictions. (AN)