Relative debugging for a highly parallel hybrid computer system

Relative debugging traces software errors by comparing two executions of a program concurrently - one code being a reference version and the other faulty. Relative debugging is particularly effective when code is migrated from one platform to another, and this is of significant interest for hybrid computer architectures containing CPUs accelerators or coprocessors. In this paper we extend relative debugging to support porting stencil computation on a hybrid computer. We describe a generic data model that allows programmers to examine the global state across different types of applications, including MPI/OpenMP, MPI/OpenACC, and UPC programs. We present case studies using a hybrid version of the `stellarator' particle simulation DELTA5D, on Titan at ORNL, and the UPC version of Shallow Water Equations on Crystal, an internal supercomputer of Cray. These case studies used up to 5,120 GPUs and 32,768 CPU cores to illustrate that the debugger is effective and practical.

[1]  Nicholas J. Wright,et al.  WRF nature run , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[2]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[3]  Frank T. Willmore,et al.  Debugging with gdb , 2016 .

[4]  David A. Randall,et al.  The Shallow Water Equations , 2006 .

[5]  Tarek A. El-Ghazawi,et al.  An evaluation of global address space languages: co-array fortran and unified parallel C , 2005, PPoPP.

[6]  Fredrik Manne,et al.  Automating the Debugging of Large Numerical Codes , 1996, SciTools.

[7]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[8]  David Abramson,et al.  Data centric highly parallel debugging , 2010, HPDC '10.

[9]  Grzegorz Rozenberg,et al.  A Decade of Concurrency Reflections and Perspectives , 1994, Lecture Notes in Computer Science.

[10]  Martin Schulz,et al.  Stack Trace Analysis for Large Scale Debugging , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[11]  Naoyuki Onodera,et al.  High-Productivity Framework on GPU-Rich Supercomputers for Operational Weather Prediction Code ASUCA , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  James F. Lyon,et al.  Monte Carlo studies of transport in stellarators , 1985 .

[13]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[14]  Dietmar Fey,et al.  High Performance Stencil Code Algorithms for GPGPUs , 2011, ICCS.

[15]  Seyong Lee,et al.  Early evaluation of directive-based GPU programming models for productive exascale computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  David Abramson,et al.  Supporting Relative Debugging for Large-scale UPC Programs , 2014, ICCS.

[17]  Harvey Richardson,et al.  High Performance Fortran: history, overview and current developments , 1996 .

[18]  M. Snir,et al.  Ghost Cell Pattern , 2010, ParaPLoP '10.

[19]  A. V. Duin,et al.  A Divide-and-Conquer/Cellular-Decomposition Framework for Million-to-Billion Atom Simulations of Chemical Reactions , 2007 .

[20]  David Abramson,et al.  Assertion Based Parallel Debugging , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[21]  Katherine A. Yelick,et al.  A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.

[22]  Stephen F. Siegel,et al.  Collective Assertions , 2011, VMCAI.

[23]  David Abramson,et al.  Implementation techniques for a parallel relative debugger , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[24]  David Abramson,et al.  Relative debugging: a new methodology for debugging scientific applications , 1996, CACM.

[25]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.

[26]  Ray W. Grout,et al.  Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Edmund M. Clarke,et al.  Verification Tools for Finite-State Concurrent Systems , 1993, REX School/Symposium.

[28]  Victor Samofalov,et al.  Automated, scalable debugging of MPI programs with Intel® Message Checker , 2005, SE-HPCS '05.

[29]  David Abramson,et al.  A Scalable Parallel Debugging Library with Pluggable Communication Protocols , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[30]  Stephen F. Siegel Verifying Parallel Programs with MPI-Spin , 2007, PVM/MPI.

[31]  Ali Ebnenasir,et al.  UPC-SPIN : A Framework for the Model Checking of UPC Programs ∗ , 2011 .

[32]  Dong Li,et al.  Interactive Program Debugging and Optimization for Directive-Based, Efficient GPU Computing , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[33]  Robert Hood,et al.  Support for Debugging Automatically Parallelized Programs , 2000, AADEBUG.

[34]  Ada Gavrilovska,et al.  Network Interfaces for High Performance Computing , 2016 .

[35]  David Abramson,et al.  Scalable Relative Debugging , 2014, IEEE Transactions on Parallel and Distributed Systems.

[36]  Jan Maluszy¿ski Verification, Model Checking, and Abstract Interpretation , 2009, Lecture Notes in Computer Science.

[37]  Ferran Silva,et al.  Numerical Validation Methods , 2011 .

[38]  David Abramson,et al.  Relative Debugging and its Application to the Development of Large Numerical Models , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[39]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Martin Schulz,et al.  Lessons learned at 208K: towards debugging millions of cores , 2008, HiPC 2008.

[41]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[43]  Peng Li,et al.  Practical Symbolic Race Checking of GPU Programs , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.