Offline parallel debugging: a case study report

Debugging is difficult; debugging parallel programs at large scale is particularly so. Interactive debugging tools continue to improve in ways that mitigate the difficulties, and the best such systems will continue to be mission critical. Such tools have their limitations, however. They are often unable to operate across many thousands of cores. Even when they do function correctly, mining and analyzing the right data from the results of thousands of processes can be daunting, and it is not easy to design interfaces that are useful and effective at large scale. One additional challenge goes beyond the functionality of the tools themselves. Leadership class systems typically operate in a batch mode intended to maximize utilization and throughput. It is generally unrealistic to expect to schedule a large block of time to operate interactively across a substantial fraction of such a system. Even when large scale interactive sessions are possible they can be expensive, and can impact system access for others.