A practical approach to performance analysis and modeling of large-scale systems

This tutorial presents a practical approach to the performance modeling of large-scale, scientific applications on high performance systems. The defining characteristic involves the description of a proven modeling approach, developed at Los Alamos, of full-blown scientific codes, ranging from a few thousand to over 100,000 lines, that has been validated on systems containing 1,000's of processors. We show how models are constructed and demonstrate how they are used to predict, explain, diagnose, and engineer application performance in existing or future codes and/or systems. Our approach does not require specific tools but rather is applicable across commonly used environments. Moreover, since our performance models are parametric, they imbue the user with the ability to "experiment ahead" with different system configurations or algorithms/coding strategies. Both will be demonstrated in studies emphasizing the application of these modeling techniques including: verifying system performance, comparison of large-scale systems, and examination of possible future systems.

[1]  Michael Lang,et al.  Infiniband Routing Table Optimizations for Scientific Applications , 2008, Parallel Process. Lett..

[2]  Michael Lang,et al.  Entering the petaflop era: The architecture and performance of Roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Michael Lang,et al.  A Performance Evaluation of the Nehalem Quad-Core Processor for Scientific Computing , 2008, Parallel Process. Lett..

[4]  Michael Lang,et al.  An empirical performance analysis of commodity memories in commodity servers , 2004, MSP '04.

[5]  Darren J. Kerbyson,et al.  Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[6]  Fabrizio Petrini,et al.  Hardware- and software-based collective communication on the Quadrics network , 2001, Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.

[7]  Adolfy Hoisie,et al.  Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications , 2000, Int. J. High Perform. Comput. Appl..

[8]  Darren J. Kerbyson A look at application performance sensitivity to the bandwidth and latency of InfiniBand networks , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[9]  Darren J. Kerbyson,et al.  Performance Analysis of an Optical Circuit Switched Network for Peta-Scale Systems , 2007, Euro-Par.

[10]  Adolfy Hoisie,et al.  Performance Optimization of Numerically Intensive Codes , 1987 .

[11]  Adolfy Hoisie,et al.  A performance comparison between the Earth Simulator and other terascale systems on a characteristic ASCI workload , 2005, Concurr. Pract. Exp..

[12]  G. Johnson,et al.  A Performance Comparison Through Benchmarking and Modeling of Three Leading Supercomputers: Blue Gene/L, Red Storm, and Purple , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[13]  Dror G. Feitelson,et al.  Flexible coscheduling: mitigating load imbalance and improving utilization of heterogeneous resources , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[14]  Fabrizio Petrini,et al.  Performance evaluation of I/O traffic and placement of I/O nodes on a high performance network , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[15]  Fabrizio Petrini,et al.  Using multirail networks in high-performance clusters , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[16]  Scott Pakin,et al.  STORM: Lightning-Fast Resource Management , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[17]  Michael Lang,et al.  A Performance and Scalability Analysis of the BlueGene/L Architecture , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[18]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[19]  Adolfy Hoisie,et al.  Exploring advanced architectures using performance prediction , 2002, International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.

[20]  Scott Pakin,et al.  A Performance Evaluation of an Alpha EV7 Processing Node , 2004, Int. J. High Perform. Comput. Appl..

[21]  Darren J. Kerbyson,et al.  A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations , 2005, The Journal of Supercomputing.

[22]  Wu-chun Feng,et al.  The Quadrics network (QsNet): high-performance clustering technology , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[23]  Adolfy Hoisie,et al.  Scalability analysis of multidimensional wavefront algorithms on large-scale SMP clusters , 1999, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[24]  Adolfy Hoisie,et al.  Performance Modeling of the Blue Gene Architecture , 2006, IEEE John Vincent Atanasoff 2006 International Symposium on Modern Computing (JVA'06).

[25]  Fabrizio Petrini,et al.  A general predictive performance model for wavefront algorithms on clusters of SMPs , 2000, Proceedings 2000 International Conference on Parallel Processing.

[26]  Alexander V. Veidenbaum,et al.  Innovative Architecture for Future Generation High-Performance Processors and Systems , 2003, Innovative Architecture for Future Generation High-Performance Processors and Systems, 2003.

[27]  Adolfy Hoisie,et al.  A comparison between the Earth Simulator and AlphaServer systems using predictive application performance models , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[28]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[29]  Wu-chun Feng,et al.  Gang scheduling with lightweight user-level communication , 2001, Proceedings International Conference on Parallel Processing Workshops.

[30]  Adolfy Hoisie,et al.  Verifying large-scale system performance during installation using modelling , 2004 .