论文信息 - Checkpointing and Rollback Recovery in Distributed Shared Memory Systems

Checkpointing and Rollback Recovery in Distributed Shared Memory Systems

Abstract : Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory systems (DSM) is expensive because of high frequency of communication. In this paper we show that, because of information redundancy, not all message-passing dependences need to be considered to roll back to a consistent state in DSM systems, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop a model of execution where client processes running an application interact atomically with a set of shared-memory server processes on every access to shared data. We show that under this model, dependences are significantly reduced over the message-passing model. We use results from simulation with multiprocessor address traces to demonstrate the reduction in dependences.

W. K. Fuchs | Bob Janssens