A data‐centric framework for debugging highly parallel applications

Contemporary parallel debuggers allow users to control more than one processing thread while supporting the same examination and visualisation operations of that of sequential debuggers. This approach restricts the use of parallel debuggers when it comes to large scale scientific applications run across hundreds of thousands compute cores. First, manually observing the runtime data to detect error becomes impractical because the data is too big. Second, performing expensive but useful debugging operations becomes infeasible as the computational codes become more complex, involving larger data structures, and as the machines become larger. This study explores the idea of a data‐centric debugging approach, which could be used to make parallel debuggers more powerful. It discusses the use of ad hoc debug‐time assertions that allow a user to reason about the state of a parallel computation. These assertions support the verification and validation of program state at runtime as a whole rather than focusing on that of only a single process state. Furthermore, the debugger's performance can be improved by exploiting the underlying parallel platform because the available compute cores can execute parallel debugging functions, while a program is idling at a breakpoint. We demonstrate the system with several case studies and evaluate the performance of the tool on a 20 000 cores Cray XE6. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  M. Molinaa,et al.  A Comparative Experimental Study of Hash Functions Applied to Packet Sampling , 2005 .

[2]  Nicholas J. Wright,et al.  WRF nature run , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[3]  M. Frey,et al.  A temporal logic language for debugging parallel programs , 1994, Proceedings of Twentieth Euromicro Conference. System Architecture and Integration.

[4]  David Abramson,et al.  Eclipse Guard : Relative Debugging in the Eclipse Framework , 2003 .

[5]  Colin J. Fidge,et al.  Partial orders for parallel debugging , 1988, PADD '88.

[6]  Simin Nadjm-Tehrani,et al.  Algorithmic Debugging with Assertions , 1989, META.

[7]  Stephen F. Siegel,et al.  Collective Assertions , 2011, VMCAI.

[8]  Clinton Jeffery,et al.  A framework for automatic debugging , 2002, Proceedings 17th IEEE International Conference on Automated Software Engineering,.

[9]  David Abramson,et al.  Relative debugging for data-parallel programs: a ZPL case study , 2000, IEEE Concurr..

[10]  Mikhail Auguston FORMAN-Program formal annotation language , 1991, [1991] Proceedings the Fifth Israel Conference on Computer Systems and Software Engineering.

[11]  Rok Sosic,et al.  Relative Debugging Using Multiple Program Versions , 1995 .

[12]  Chonchanok Viravan Enhancing debugging technology , 1994 .

[13]  D. J. Tildesley,et al.  Equation of state for the Lennard-Jones fluid , 1979 .

[14]  V. Gregory Weirs,et al.  Adaptive Mesh Refinement - Theory and Applications , 2008 .

[15]  Gene H. Golub,et al.  Algorithms for Computing the Sample Variance: Analysis and Recommendations , 1983 .

[16]  A. V. Duin,et al.  A Divide-and-Conquer/Cellular-Decomposition Framework for Million-to-Billion Atom Simulations of Chemical Reactions , 2007 .

[17]  Berend Smit,et al.  Understanding Molecular Simulations: from Algorithms to Applications , 2002 .

[18]  Dhabaleswar K. Panda,et al.  DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[19]  Aladdin: Assembly Language Assertion Driven Debugging Interpreter , 1979, IEEE Transactions on Software Engineering.

[20]  David Abramson,et al.  Data centric highly parallel debugging , 2010, HPDC '10.

[21]  David Abramson,et al.  A Scalable Parallel Debugging Library with Pluggable Communication Protocols , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[22]  David Abramson,et al.  The RMIT Data Flow Computer: A Hybrid Architecture , 1990, Comput. J..

[23]  David Abramson,et al.  Parallel Relative Debugging with Dynamic Data Structures , 2003, PDCS.

[24]  William J. Schroeder,et al.  The Visualization Toolkit , 2005, The Visualization Handbook.

[25]  Cheng Zhang,et al.  ParaViz : A Spatially Decomposed Parallel Visualization Algorithm Using Hierarchical Visibility Ordering , 2007 .

[26]  Manuel V. Hermenegildo,et al.  A Framework for Assertion-Based Debugging in Constraint Logic Programming , 1998, CP.

[27]  Harry D. Foster,et al.  Assertion-Based Design , 2010 .

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Laxmikant V. Kalé,et al.  Dynamic high-level scripting in parallel applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[30]  Rashmi Data Mining: A Knowledge Discovery Approach , 2012 .

[31]  Doron A. Peled,et al.  Temporal Debugging for Concurrent Systems , 2002, TACAS.

[32]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.

[33]  David Abramson,et al.  Relative debugging: a new methodology for debugging scientific applications , 1996, CACM.

[34]  Tecnología NASA Advanced Supercomputing Division , 2010 .

[35]  Wei Lin,et al.  WiDS Checker: Combating Bugs in Distributed Systems , 2007, NSDI.

[36]  Rajiv K. Kalia,et al.  Scalable and portable implementation of the fast multipole method on parallel computers , 2003 .

[37]  Andreas Zeller,et al.  Why Programs Fail: A Guide to Systematic Debugging , 2005 .

[38]  Robert Hood,et al.  A portable debugger for parallel and distributed programs , 1994, Proceedings of Supercomputing '94.