Runtime system level fault tolerance for a distributed functional language

Distributed Fault Tolerance entails detecting errors, confining the damage caused, recovery from the errors, and providing continued service on a network of co-operating machines. Functional languages potentially offer benefits for distributed fault tolerance: many computations are pure, and hence have no side-effects to be reversed during error recovery. Moreover functional languages have a high-level runtime system (RTS) where computations and data are readily manipulated. We propose a new RTS level of fault tolerance for distributed functional languages, and outline a design for its implementation for the GdH language. Glasgow distributed Haskell is a small extension to the Haskell language and the fault tolerance design utilises existing distributed graph reduction mechanisms. The design distinguishes between pure and impure computations; impure or side effecting computations must be recovered using conventional exceptionbased techniques, but the RTS attempts implicit backward recovery of pure computations.