Engineering Algorithms for Scalability through Continuous Validation of Performance Expectations

Many libraries in the HPC field use sophisticated algorithms with clear theoretical scalability expectations. However, hardware constraints or programming bugs may sometimes render these expectations inaccurate or even plainly wrong. While algorithm and performance engineers have already been advocating the systematic combination of analytical performance models with practical measurements for a very long time, we go one step further and show how this comparison can become part of automated testing procedures. The most important applications of our method include initial validation, regression testing, and benchmarking to compare implementation and platform alternatives. Advancing the concept of performance assertions, we verify asymptotic scaling trends rather than precise analytical expressions, relieving the developer from the burden of having to specify and maintain very fine-grained and potentially non-portable expectations. In this way, scalability validation can be continuously applied throughout the whole development cycle with very little effort. Using MPI and parallel sorting algorithms as examples, we show how our method can help uncover non-obvious limitations of both libraries and underlying platforms.

[1]  Laxmikant V. Kalé,et al.  Highly scalable parallel sorting , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[3]  Jeffrey S. Vetter,et al.  Asserting Performance Expectations , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[4]  Jiri Kraus,et al.  GPUMAFIA: Efficient Subspace Clustering with MAFIA on GPUs , 2013, Euro-Par.

[5]  Jesper Larsson Träff,et al.  SKaMPI: a comprehensive benchmark for public benchmarking of MPI , 2002, Sci. Program..

[6]  Torsten Hoefler,et al.  Fast Multi-parameter Performance Modeling , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[7]  Jesper Larsson Träff,et al.  Self-Consistent MPI Performance Guidelines , 2010, IEEE Transactions on Parallel and Distributed Systems.

[8]  Torsten Hoefler,et al.  Accurately measuring collective operations at massive scale , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[10]  Amith R. Mamidala,et al.  PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[11]  Dirk Schmidl,et al.  Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir , 2011, Parallel Tools Workshop.

[12]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[13]  Christof Vömel,et al.  ScaLAPACK's MRRR algorithm , 2010, TOMS.

[14]  Debra Hensgen,et al.  Two algorithms for barrier synchronization , 1988, International Journal of Parallel Programming.

[15]  Adolfy Hoisie,et al.  Palm: easing the burden of analytical performance modeling , 2014, ICS '14.

[16]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[17]  Xin Zhao,et al.  Scalable Memory Use in MPI: A Case Study with MPICH2 , 2011, EuroMPI.

[18]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[19]  Christian H. Bischof,et al.  How Many Threads will be too Many? On the Scalability of OpenMP Implementations , 2015, Euro-Par.

[20]  Torsten Hoefler,et al.  Using automated performance modeling to find scalability bugs in complex codes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Susan Coghlan,et al.  The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale , 2006, 2006 IEEE International Conference on Cluster Computing.

[23]  Yannick Berens Scalability Validation of Parallel Sorting Algorithms , 2017 .

[24]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[25]  Torsten Hoefler,et al.  Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications , 2017, PPoPP.

[26]  Sascha Hunold,et al.  Automatic Verification of Self-consistent MPI Performance Guidelines , 2016, Euro-Par.

[27]  Torsten Hoefler,et al.  Mpi on Millions of Cores * , 2022 .

[28]  Torsten Hoefler,et al.  The impact of network noise at large-scale communication performance , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[29]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[30]  Sascha Hunold,et al.  MPI Benchmarking Revisited: Experimental Design and Reproducibility , 2015, ArXiv.

[31]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[32]  Torsten Hoefler,et al.  Exascaling Your Library: Will Your Implementation Meet Your Expectations? , 2015, ICS.

[33]  Peter Sanders,et al.  Robust Massively Parallel Sorting , 2016, ALENEX.

[34]  Sascha Hunold,et al.  Reproducible MPI Benchmarking is Still Not as Easy as You Think , 2016, IEEE Transactions on Parallel and Distributed Systems.

[35]  Peter Sanders Algorithm Engineering - An Attempt at a Definition , 2009, Efficient Algorithms.

[36]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[37]  Jesper Larsson Träff,et al.  mpicroscope: Towards an MPI Benchmark Tool for Performance Guideline Verification , 2012, EuroMPI.

[38]  Felix Wolf,et al.  Parallel Sorting with Minimal Data , 2011, EuroMPI.

[39]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[40]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[41]  Peter J. Rousseeuw,et al.  The Remedian: A Robust Averaging Method for Large Data Sets , 1990 .

[42]  Scott B. Baden,et al.  Modeling and predicting performance of high performance computing applications on hardware accelerators , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[43]  Felix Wolf,et al.  A Scalable Parallel Sorting Algorithm Using Exact Splitting , 2010 .

[44]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[45]  Gabriel Wittum,et al.  10, 000 Performance Models per Minute - Scalability of the UG4 Simulation Framework , 2015, Euro-Par.

[46]  W. Hays Applied Regression Analysis. 2nd ed. , 1981 .

[47]  Felix Wolf,et al.  Off-Road Performance Modeling - How to Deal with Segmented Data , 2017, Euro-Par.