Debugging Large Scale Applications in a Virtualized Environment

With the advent of petascale machines with hundreds of thousands of processors, debugging parallel applications is becoming an increasing challenge. Aside from the complicated debugging techniques required to debug applications at such scale, it is often difficult to gain access to these machines for a sufficient period of time, if at all. Some existing parallel debuggers are capable of handling these machines, but they still require the whole machine to be allocated. In this paper, we present an innovative approach to address debugging on such extreme scales. By leveraging the concept of object-based processor virtualization, our technique enables debugging of even a million processor execution under a simulated environment using only a relatively small cluster. We describe the obstacles we overcame to achieve this goal within two message passing programming models: CHARM++ and MPI. We demonstrate the results using real world applications such as Molecular Dynamics and Cosmological simulation programs.

[1]  Laxmikant V. Kalé,et al.  Massively parallel cosmological simulations with ChaNGa , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Laxmikant V. Kalé,et al.  Simulation-Based Performance Prediction for Large Parallel Machines , 2005, International Journal of Parallel Programming.

[3]  Laxmikant V. Kalé,et al.  Automatic MPI to AMPI Program Transformation Using Photran , 2010, Euro-Par Workshops.

[4]  Samuel T. King,et al.  Debugging Operating Systems with Time-Traveling Virtual Machines (Awarded General Track Best Paper Award!) , 2005, USENIX Annual Technical Conference, General Track.

[5]  Laxmikant V. Kale,et al.  Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects , 2009 .

[6]  Gregory R. Watson Craig E. Rasmussen A Strategy for Addressing the Needs of Advanced Scientific Computing Using Eclipse as a Parallel Tools Platform , 2005 .

[7]  Muli Ben-Yehuda,et al.  Virtual machine time travel using continuous data protection and checkpointing , 2008, OPSR.

[8]  R Day,et al.  The eclipse open-development platform , 2008 .

[9]  Laxmikant V. Kalé,et al.  Memory tagging in Charm++ , 2008, PADTAD '08.

[10]  Yi Pan,et al.  The Virtual Debugging System for Developing Embedded Software Using Virtual Machinery , 2004, EUC.

[11]  Laxmikant V. Kale,et al.  Programming Petascale Applications with Charm , 2007 .

[12]  Laxmikant V. Kalé,et al.  Overcoming scaling challenges in biomolecular simulations across multiple platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  Amin Vahdat,et al.  DieCast: Testing Distributed Systems with an Accurate Scale Model , 2008, TOCS.

[14]  Michael T. Heath,et al.  A system integration framework for coupled multiphysics simulations , 2006, Engineering with Computers.

[15]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[16]  Laxmikant V. Kalé,et al.  Performance evaluation of adaptive MPI , 2006, PPoPP '06.

[17]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[18]  Xiaolin Li,et al.  Advanced Computational Infrastructures for Parallel and Distributed Applications , 2009 .

[19]  Orran Krieger,et al.  Virtualization for high-performance computing , 2006, OPSR.

[20]  Laxmikant V. Kalé,et al.  A parallel-object programming model for petaflops machines and blue gene/cyclops , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[21]  CHAO MEI A PRELIMINARY INVESTIGATION OF EMULATING APPLICATIONS THAT USE PETABYTES OF MEMORY ON PETASCALE MACHINES , 2008 .

[22]  Laxmikant V. Kalé,et al.  Robust non-intrusive record-replay with processor extraction , 2010, PADTAD '10.