Monitoring and Management-Support of Distributed Systems

This paper describes a tool for on-line monitoring of distributed systems. The tool consists of a hardware component and software level, i.e., a hybrid monitor, which is capable of presenting the interactive user and the local operating system with a high-level information and performance evaluation of the activities in the host system with minimal interferences. A special hardware support, which consists of a test and measurement processor (TMP), was designed and has been implemented in the nodes of an experimental multicomputer system. The main function of the TMP is to execute software for monitoring the local system behavior and to measure the performance of both the resident operating system and the application software. The TMP can also be used to execute low level operating system functions, to manage local resources and to trigger time driven events in order to reduce the overhead of the host operating system. The operations of the TMP are completely transparent to the users with a minimal, less than 0.1%, overhead to the hardware system. In the experimental system, all the TMPs were connected with a central monitoring station, using an independent communication network, in order to provide a global view of the monitored system. The central monitoring station displays the resulting information in easy-to-read charts and graphs. Our experience with the TMP shows that it promotes an improved understanding of run-time behavior and performance measurements, to derive qualitative and quantitative assessments of distributed systems.

[1]  Charles L. Seitz,et al.  The cosmic cube , 1985, CACM.

[2]  Barbara Liskov,et al.  Primitives for distributed computing , 1979, SOSP '79.

[3]  Friedemann Mattern,et al.  Key Concepts of the INCAS Multicomputer Project , 1987, IEEE Transactions on Software Engineering.

[4]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[5]  Jacques Cohen,et al.  Garbage Collection of Linked Data Structures , 1981, CSUR.

[6]  D. Haban,et al.  Monitoring and performance measuring distributed systems during operation , 1988, SIGMETRICS 1988.

[7]  Donald F. Towsley,et al.  A comparison of priority-based decentralized load balancing policies , 1986, SIGMETRICS '86/PERFORMANCE '86.

[8]  J. Nievergelt,et al.  Special Feature: Monitoring Program Execution: A Survey , 1981, Computer.

[9]  Bernhard Plattner,et al.  Monitoring Program Execution: A Survey. , 1981 .

[10]  Amnon Barak,et al.  Mos: A multicomputer distributed operating system , 1985, Softw. Pract. Exp..

[11]  Noga Alon,et al.  On Disseminating Information Reliably without Broadcasting , 1987, ICDCS.

[12]  Karen A. Frenkel,et al.  Evaluating two massively parallel machines , 1986, CACM.

[13]  Amnon Barak,et al.  A distributed load‐balancing policy for a multicomputer , 1985, Softw. Pract. Exp..

[14]  Zvi Drezner,et al.  An Asychronous Algorithm for Scattering Information between the Active Nodes of a Multicomputer System , 1986, J. Parallel Distributed Comput..

[15]  W. Weigel,et al.  Global events and global breakpoints in distributed systems , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track.

[16]  J. E. Lambert,et al.  Program debugging and performance evaluation aids for a multi-microprocessor development system , 1984, Softw. Microsystems.

[17]  Liba Svobodova Online system performance measurements with software and hybrid monitors , 1973, SOSP '73.

[18]  Phillip Krueger,et al.  A comparison of preemptive and non-preemptive load distributing , 1988, [1988] Proceedings. The 8th International Conference on Distributed.