Automatic detection of errors in distributed systems

Debugging in a distributed environment is very complex when compared to debugging in a uniprocessor or sequential environment. The order of events in a distributed environment is not deterministic, and this order of events may some times produce some unexpected errors. These errors which depend on the timing of events are typical of distributed systems. In this paper we proposed an approach based on the timing graphs concept to debug these timing errors. Here we give some conditions which should be satisfied by a timing graph, if the execution of that instance of the program, from which the graph is constructed, doesn’t have any errors. The graphs are constructed using the information gathered at the run time using tracing mechanism. This approach uses topological sorting of directed acyclic graphs in analyzing the timing graphs. In this paper, we have developed the theory behind this approach and presented an algorithm to automate the procedure. Since the manual approach to this problem is cumbersome and error prone, this automatic error detection procedure simplifies the detection of the cause of the error for some standard error patterns.

[1]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[2]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[3]  Gregor von Bochmann,et al.  Delay-Independent Design for Distributed Systems , 1988, IEEE Trans. Software Eng..

[4]  Raphael A. Finkel,et al.  Handling Timing Errors in Distributed Programs , 1988, IEEE Trans. Software Eng..

[5]  Hector Garcia-Molina,et al.  Debugging a Distributed Computing System , 1984, IEEE Transactions on Software Engineering.

[6]  Konrad Slind,et al.  Monitoring distributed systems , 1987, TOCS.

[7]  Hemant K. Jain A Comprehensive Model for the Design of Distributed Computer Systems , 1987, IEEE Transactions on Software Engineering.

[8]  J. Wileden,et al.  Describing and analyzing distributed software system designs , 1985, ACM Trans. Program. Lang. Syst..

[9]  Barton Paul Miller Performance characterization of distributed programs (debugging) , 1984 .

[10]  Per Brinch Hansen,et al.  Network: A Multiprocessor Program , 1978, IEEE Transactions on Software Engineering.