Optimal Runtime Tracing of Message Passing Programs

The widespread adoption of distributed computing has accentuated the need for an effective set of support tools to facilitate debugging and monitoring of distributed programs. Unfortunately for distributed programs, this is not a trivial task. Distributed programs are inherently non-deterministic in nature. Two runs of the same program with the same input data do not result in the same execution sequence. Cyclic debugging is one of the most common strategies used in debugging. To allow cyclic debugging, messages are traced for repeatable execution. In this paper we define a race in the context of a message passing program and present a simple proof that it is impossible to have an algorithm which will produce an optimal message trace (least number on messages traced), in general. We then present two algorithms, Algorithm A and Algorithm B. Both the algorithms trace messages at run-time, i.e., when a message is received at a process. Algorithm A does optimal tracing of messages, given the fact that messages are traced at run-time , and no information about the future is available when these decisions are made . Algorithm B improves on the storage requirement and execution time of Algorithm A, and is based on the surprising fact that only (n-1) buffers are required per process for optimal run-time decision making, where n is the number of processes in the system. This algorithm is a big improvement over the algorithm presented in [10], which does optimal tracing only when the races amongst messages are transitive.