Noise-Tolerant Explicit Stencil Computations for Nonuniform Process Execution Rates

Next-generation HPC computing platforms are likely to be characterized by significant, unpredictable nonuniformities in execution time among compute nodes and cores. The resulting load imbalances from this nonuniformity are expected to arise from a variety of sources—manufacturing discrepancies, dynamic power management, runtime component failure, OS jitter, software-mediated resiliency, and TLB/- cache performance variations, for example. It is well understood that existing algorithms with frequent points of bulk synchronization will perform relatively poorly in the presence of these sources of process nonuniformity. Thus, recasting classic bulk synchronous algorithms into more asynchronous, coarse-grained parallelism is a critical area of research for next-generation computing. We propose a class of parallel algorithms for explicit stencil computations that can tolerate these nonuniformities by decoupling per process communication and computation in order for each process to progress asynchronously while maintaining solution correctness. These algorithms are benchmarked with a 1D domain decomposed (“slabbed”) implementation of the 2D heat equation as a model problem, and are tested in the presence of simulated nonuniform process execution rates. The resulting performance is compared to a classic bulk synchronous implementation of the model problem. Results show that the runtime of this article’s algorithm on a machine with simulated process nonuniformities is 5--99% slower than the runtime of its classic counterpart on a machine free of nonuniformities. However, when both algorithms are run on a machine with comparable synthetic process nonuniformities, this article’s algorithm is 1--37 times faster than its classic counterpart.

[1]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[2]  HammoudaAdam,et al.  Noise-Tolerant Explicit Stencil Computations for Nonuniform Process Execution Rates , 2015 .

[3]  Torsten Hoefler,et al.  Using Simulation to Evaluate the Performance of Resilience Strategies at Scale , 2013, PMBS@SC.

[4]  Martin Schulz,et al.  Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[5]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[6]  Masato Takeichi,et al.  Formal derivation of efficient parallel programs by construction of list homomorphisms , 1997, TOPL.

[7]  Viktor K. Prasanna,et al.  High Performance Computing - HiPC 2005, 12th International Conference, Goa, India, December 18-21, 2005, Proceedings , 2005, HiPC.

[8]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[9]  Albert Cohen,et al.  Coarse-Grained Loop Parallelization: Iteration Space Slicing vs Affine Transformations , 2009, 2009 Eighth International Symposium on Parallel and Distributed Computing.

[10]  Christel Baier,et al.  Principles of Model Checking (Representation and Mind Series) , 2008 .

[11]  Christel Baier,et al.  Principles of model checking , 2008 .

[12]  Katherine A. Yelick,et al.  Communication avoiding and overlapping for numerical linear algebra , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  B GibbonsPhillip ACM transactions on parallel computing , 2014 .

[14]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[15]  Kurt Mehlhorn,et al.  Algorithms - ESA 2008, 16th Annual European Symposium, Karlsruhe, Germany, September 15-17, 2008. Proceedings , 2008, ESA.

[16]  Dan Tsafrir,et al.  System noise, OS clock ticks, and fine-grained parallel applications , 2005, ICS '05.

[17]  Aditya Konduri,et al.  Asynchronous finite-difference schemes for partial differential equations , 2014, J. Comput. Phys..

[18]  Wu-chun Feng,et al.  Making a Case for Efficient Supercomputing , 2003, ACM Queue.

[19]  Allen D. Malony,et al.  The ghost in the machine: observing the effects of kernel operation on parallel application performance , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[20]  B. Fryxell,et al.  FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes , 2000 .

[21]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Susan Coghlan,et al.  The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale , 2006, 2006 IEEE International Conference on Cluster Computing.

[23]  L. Ridgway Scott,et al.  Scientific Parallel Computing , 2005 .

[24]  John D. Davis,et al.  Accounting for Variability in Large-Scale Cluster Power Models , 2011 .

[25]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[26]  Satish Narayana Srirama,et al.  Viability of the bulk synchronous parallel model for science on cloud , 2013, 2013 International Conference on High Performance Computing & Simulation (HPCS).

[27]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[28]  Nisheeth K. Vishnoi,et al.  The Impact of Noise on the Scaling of Collectives: A Theoretical Approach , 2005, HiPC.

[29]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[30]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[31]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[32]  Yifeng Chen,et al.  Logic of global synchrony , 2001, TOPL.

[33]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[34]  William W. Pugh,et al.  Fine-grained analysis of array computations , 1998 .

[35]  Leslie G. Valiant,et al.  A bridging model for multi-core computing , 2008, J. Comput. Syst. Sci..

[36]  Anthony Skjellum,et al.  Using MPI: portable parallel programming with the message-passing interface, 2nd Edition , 1999, Scientific and engineering computation series.

[37]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[38]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[39]  M. Snir,et al.  Ghost Cell Pattern , 2010, ParaPLoP '10.

[40]  Anthony Skjellum,et al.  Using MPI: Portable Programming with the Message-Passing Interface , 1999 .

[41]  Kevin T. Pedretti,et al.  The impact of system design parameters on application noise sensitivity , 2010, 2010 IEEE International Conference on Cluster Computing.

[42]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[43]  Jeremy M. R. Martin,et al.  Dynamic BSP : towards a flexible approach to parallel computing over the grid , 2004 .

[44]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[45]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[46]  Torsten Hoefler,et al.  The Effect of Network Noise on Large-Scale Collective Communications , 2009, Parallel Process. Lett..

[47]  Frédéric Loulergue,et al.  Systematic Development of Correct Bulk Synchronous Parallel Programs , 2010, 2010 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[48]  Henry Hoffmann,et al.  Patterns and statistical analysis for understanding reduced resource computing , 2010, OOPSLA.

[49]  Alan Stewart A programming model for BSP with partitioned synchronisation , 2010, Formal Aspects of Computing.

[50]  Yun He,et al.  A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[51]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[52]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[53]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[54]  Edward A. Ashcroft,et al.  Proving Assertions about Parallel Programs , 1975, J. Comput. Syst. Sci..

[55]  Gérard M. Baudet,et al.  Asynchronous Iterative Methods for Multiprocessors , 1978, JACM.

[56]  G. Allen,et al.  Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[57]  Jin-Soo Kim,et al.  Relaxed Barrier Synchronization for the BSP Model of Computation on Message-Passing Architectures , 1998, Inf. Process. Lett..

[58]  Pradipta De,et al.  Impact of Noise on Scaling of Collectives: An Empirical Evaluation , 2006, HiPC.

[59]  Albert Cohen,et al.  Synchronization-Free Automatic Parallelization: Beyond Affine Iteration-Space Slicing , 2009, LCPC.

[60]  Richard J. Lipton,et al.  Reduction: a method of proving properties of parallel programs , 1975, CACM.

[61]  Vasil P. Vasilev BSPGRID: Variable Resources Parallel Computation and Multiprogrammed Parallelism , 2003, Parallel Process. Lett..

[62]  John Shalf,et al.  Abstract Machine Models and Proxy Architectures for Exascale Computing , 2014, 2014 Hardware-Software Co-Design for High Performance Computing.

[63]  Vijayalakshmi Srinivasan,et al.  Programming with relaxed synchronization , 2012, RACES '12.

[64]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.