Monitoring of Distributed Memory Multicomputer Programs

abstract Programs for distributed memory parallel machines are generally considered to be much more complex than sequential programs. Monitoring systems that collect runtime information about a program execution often prove a valuable help in gaining insight into the behavior of a parallel program and thus can improve its performance. This report describes in a systematic and comprehensive way the issues involved in the monitoring of parallel programs running on distributed memory systems. It aims to provide a structured general approach to the eld of monitoring and a guide for further documentation. First the diierent approaches to parallel monitoring are presented and the problems encountered are discussed and classiied. In the second part, the main existing systems are described to provide the user with a feeling for the possibilities and limitations of real tools. Acknowledgments The authors gratefully acknowledge the valuable and constructive comments of T. Bemmerl, M. Heath, P. Worley and the referees that helped to improve the earlier version of this paper.

[1]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[2]  Konrad Slind,et al.  Monitoring distributed systems , 1987, TOCS.

[3]  David Notkin,et al.  Voyeur: graphical views of parallel programs , 1988, PADD '88.

[4]  Colin J. Fidge,et al.  Partial orders for parallel debugging , 1988, PADD '88.

[5]  Peter C. Bates,et al.  Distributed debugging tools for heterogeneous distributed systems , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[6]  Janice M. Stone A graphical representation of concurrent processes , 1988, PADD '88.

[7]  Allen D. Malony,et al.  Faust: an integrated environment for parallel programming , 1989, IEEE Software.

[8]  Yves Robert,et al.  Evaluating speedups on distributed memory architectures , 1989, Parallel Comput..

[9]  Allen D. Malony,et al.  An integrated performance data collection, analysis, and visualization system , 1989 .

[10]  Cherri M. Pancake,et al.  Models for visualization in parallel debuggers , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[11]  Eric G. Manning,et al.  A framework for distributed debugging , 1990, IEEE Software.

[12]  Allen D. Malony,et al.  A hardware-based performance monitor for the Intel iPSC/2 hypercube , 1990, ICS '90.

[13]  Thomas J. Leblanc,et al.  Analyzing Parallel Program Executions Using Multiple Views , 1990, J. Parallel Distributed Comput..

[14]  R.D. McLaren,et al.  Instrumentation and Performance Monitoring of Distributed Systems , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[15]  Allen D. Malony,et al.  Integrating performance data collection, analysis, and visualization , 1990 .

[16]  Allen D. Malony,et al.  Standards working group summary , 1990 .

[17]  Thomas Bemmerl,et al.  The TOPSYS Architecture , 1990, CONPAR.

[18]  Franz Abstreiter,et al.  Visualizing and Analysing the Runtime Behavior of Parallel Programs , 1990, CONPAR.

[19]  Alva L. Couch,et al.  Monitoring ParalIel Executions in Real Time , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[20]  Helmar Burkhart,et al.  CONPAR 90 — VAPP IV , 1990, Lecture Notes in Computer Science.

[21]  Thomas Bemmerl,et al.  PATOP for Performance Tuning of Parallel Programs , 1990, CONPAR.

[22]  Thomas Bemmerl,et al.  The Distributed Monitor System of TOPSYS , 1990, CONPAR.

[23]  Dan C. Marinescu,et al.  Specification and identification of events for debugging and performance monitoring of distributed multiprocessor systems , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[24]  Dan C. Marinescu,et al.  Models for Monitoring and Debugging Tools for Parallel and Distributed Software , 1990, J. Parallel Distributed Comput..

[25]  Daniel A. Reed,et al.  Scalable Performance Environments for Parallel Systems , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[26]  Barton P. Miller,et al.  The integration of application and system based metrics in a parallel program performance tool , 1991, PPOPP '91.

[27]  Michael T. Heath,et al.  Visualizing the performance of parallel programs , 1991, IEEE Software.

[28]  Andreas Quick,et al.  Monitor-Supported Analysis of a Communication System for Transputer-Networks , 1991, EDMCC.

[29]  Joan M. Francioni,et al.  The Sounds of Parallel Programs , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[30]  Alan Wagner,et al.  TIPS: transputer-based interactive parallelizing system , 1991 .

[31]  P. H. Worley A new PICL trace file format , 1992 .

[32]  Bernard Tourancheau,et al.  The Design of the General Parallel Monitoring System , 1992, Programming Environments for Parallel Computing.

[33]  Claude Jard,et al.  Interval Approximations of Message Causality in Distributed Executions , 1992, STACS.

[34]  Richard Taylor,et al.  Maritxu: Generic Visualization of Highly Parallel Processing , 1992, Programming Environments for Parallel Computing.

[35]  Thomas H. Dunigan,et al.  Hypercube clock synchronization , 1991, Concurr. Pract. Exp..

[36]  Ian Glendinning,et al.  Generic Visualization and Performance Monitoring Tools for Message Passing Parallel Systems , 1992, Programming Environments for Parallel Computing.

[37]  M.,et al.  An Overview of the Pablo Performance Analysis , 1992 .

[38]  André Schiper,et al.  Execution Replay: A Mechanism for Integrating a Visualization Tool with a Symbolic Debugger , 1992, CONPAR.

[39]  Richard Taylor,et al.  Process and processor interaction: architecture independent visualisation schema , 1993 .

[40]  Barton P. Miller,et al.  Dynamic control of performance monitoring on large scale parallel systems , 1993, ICS '93.

[41]  Kayhan Imre Experiences with monitoring and visualising the performance of parallel programs , 1993 .

[42]  Bernard Tourancheau,et al.  Distributed monitoring for scalable massively parallel machines , 1993 .

[43]  Barton P. Miller What to Draw? When to Draw? An Essay on Parallel Program Visualization , 1993, J. Parallel Distributed Comput..

[44]  André Schiper,et al.  ParaRex: a programming environment integrating execution replay and visualization , 1993 .

[45]  Vipin Kumar,et al.  Analyzing performance of large scale parallel systems , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[46]  Jaeyoung Choi,et al.  The design of scalable software libraries for distributed memory concurrent computers , 1994, Proceedings of 8th International Parallel Processing Symposium.