Low overhead high performance runtime monitoring of collective communication

Scalability of parallel applications on clusters and multi-clusters is often limited by communication performance. Message tracing can provide data for understanding bottlenecks, and for performance tuning. However, it requires collecting, storing, analyzing, and transferring potentially gigabytes of data. We have designed the EventSpace system for low overhead and high performance runtime collective communication trace analysis. EventSpace separates the perturbation and performance requirements of data collection, analysis, gathering sand visualization. Data collection overhead is low since the minimum amount of data is recorded and stored temporarily in main memory. The recorded data is either discarded or analyzed on demand using available cluster resources. Analysis is distributed for high performance, and coscheduled with the computation and communication system threads for low perturbation. Gathering of analyzed data is done using extensible collective communication operations, which can be tuned to trade off between performance and monitoring overhead. EventSpace was used to do run-time monitoring and analysis of collective communication micro-benchmarks run on clusters, multi-clusters, and multi-clusters with emulated WAN links. Performance data was collected, analyzed and gathered with 0-3% monitoring overhead.

[1]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[2]  Jack J. Dongarra,et al.  Review of Performance Analysis Tools for MPI Parallel Programs , 2001, PVM/MPI.

[3]  David LaFrance-Linden,et al.  Ygdrasil: Aggregator Network Toolkit for Large Scale Systems and the Grid , 2004, PARA.

[4]  Steve Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP's , 1999, SC.

[5]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[6]  Remzi H. Arpaci-Dusseau,et al.  Run-time adaptation in river , 2003, TOCS.

[7]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[8]  Lars Ailo Bongo,et al.  The Longcut Wide Area Network Emulator. Design and Evaluation , 2005 .

[9]  Brian Vinter,et al.  Past-Set - A Distributed Structured Shared Memory System , 1999, HPCN Europe.

[10]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[12]  Dean Sutherland,et al.  The architecture of the Remos system , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[13]  Jason Maassen,et al.  Programming environments for high-performance Grid computing: the Albatross project , 2002, Future Gener. Comput. Syst..

[14]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[15]  Andrea C. Arpaci-Dusseau,et al.  Implicit coscheduling: coordinated scheduling with implicit information in distributed systems , 2001, TOCS.

[16]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[17]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[18]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[19]  David S. Rosenblum,et al.  Design and evaluation of a wide-area event notification service , 2001, TOCS.

[20]  Jeffrey S. Vetter,et al.  An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[21]  John Markus Bjørndalen,et al.  Collective Communication Performance Analysis Within the Communication System , 2004, Euro-Par.

[22]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[23]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[24]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[25]  Chita R. Das,et al.  Coscheduling in Clusters: Is It a Viable Alternative? , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[26]  John Markus Bjørndalen,et al.  EventSpace - Exposing and Observing Communication Behavior of Parallel Cluster Applications , 2003, Euro-Par.

[27]  Brian Vinter,et al.  Java PastSet: a structured distributed shared memory system , 2003, IEE Proc. Softw..

[28]  Xin Yuan,et al.  CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.

[29]  William E. Johnston,et al.  The NetLogger methodology for high performance distributed systems performance analysis , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[30]  Jeffrey S. Vetter,et al.  Dynamic statistical profiling of communication activity in distributed applications , 2002, SIGMETRICS '02.