Leveraging checkpoint/restore to optimize utilization of cloud compute resources

Cloud computing services have varying performance characteristics that the cloud provider often hides from the user. Thus, it is difficult for a user to make operational decisions about when and where to run their jobs. In this paper, we present a series of experiments on a cloud computing platform to understand these characteristics and then present a series of strategies to optimize utilization on these cloud platforms. Specifically, our experiments were performed on the lowest tier of Amazon Elastic Cloud (EC2) resources, and initial tests measured performance characteristics of these resources at different locations over time. Testing revealed that certain performance measures could be improved by location choice. For short-duration jobs, CPU performance was nearly constant, but storage performance had more variability. For long-duration jobs, CPU throttling caused a significant penalty. Using a system of checkpoint and migration between virtual machines allowed this CPU penalty to be avoided, resulting in significant savings and improved runtime.