Simulation analysis of large-scale computational science experiment workflows

In this dissertation, we have made several contributions to the study of performance and productivity in high-performance computing. This work is motivated by the lack of tools and methodology for performance analysis of whole computational science experiments. The thesis of this work is that such a lack of tools results in lost productivity, and that using simulation to rank potential experiment configurations and otherwise evaluate workflow optimizations can result in reduced overall experiment completion time. We test the assertions in our thesis in several steps, detailed in the following paragraphs: We characterize the problem of complexity in planning and analyzing computational science experiments in more depth than previous work. We give examples of computations for which data transfer is on the critical path, which is evidence that the space of variables for optimizing experiment completion time includes at least queue wait time and network bandwidth as well as compute speed. We show through interviews and surveys that many scientists are over-conservative about evaluating these variables because of uncertainty about their potential impact and the cost in time and allocation of evaluating workflow configurations. We show in simulation that using simulation to evaluate a wider set of optimization parameters can cut completion time in half for some workflows, in minutes and with no allocation used. Because this broad evaluation is not common in current practice this represents a real loss of productivity, documented in this dissertation. Addressing the lack of whole-experiment analysis tools, we present a language for describing HPC workflow at a high level, intended to support rapid prototyping and analysis of workflow changes and their interactions with system characteristics. We develop and demonstrate a simulation framework for analysis of computational science experiment workflow that uses our high-level workflow language as an input, along with a separate set of system parameters that describe available resources. We evaluate the accuracy of our simulation model and system parameterization using a set of synthetic experiments run on real systems. We showed that the simulation accuracy of completion time for these experiments ranged from approximately 7% mean absolute error to as high as 62%. Because the predictions would most likely be used to rank configurations, we recast the problem from a prediction to ranking, and showed that with appropriate data, our approach to simulation can rank real systems by completion time correctly in 74% to 79% of the observed cases. This level of accuracy combined with the ability to evaluate a large set of potential workflow configurations in less than ten minutes of simulation presents a significant increase in productivity. We then evaluate the potential of workflow analysis and prediction for larger experiments using simulation. We demonstrated the capability of our analysis tools to identify situations where counter-intuitive actions can reduce time to completion in a grid similar to the TeraGrid. In particular, we show that in such a grid, a workflow with sequentially dependent jobs such as a single long-running simulation can potentially cut overall completion time by as much as half by waiting up to three weeks for access to a system with lower queue wait time or higher bandwidth to archival storage. This demonstrates that evaluating options through simulation can result in reduced completion time on real systems. This dissertation is intended to provide a justification for more attention to be paid to the entire process of computational science with shared resources, not just the traditionally important individual bottlenecks such as compute performance and queue scheduling. Those topics remain important, but analyzing the practice of computational science as a complete system will permit more comprehensive solutions. We have also attempted to providea blueprint for improved analysis tools that will hopefully help guide system design, compute center policy, and even planning on the scale of a funding agency program. Concerns about the productivity effects of proposed system awards and policy emphases should be able to have a sound empirical backing, and this work is intended to show one approach toward that goal.