论文信息 - Online Checkpointing with Improved Worst-Case Guarantees

Online Checkpointing with Improved Worst-Case Guarantees

In the online checkpointing problem, the task is to continuously maintain a set of k checkpoints that allow rewinding an ongoing computation faster than by a full restart. The only operation allowed is to replace an old checkpoint by the current state. Our aim is checkpoint placement strategies that minimize rewinding cost, i.e., such that at all times T when requested to rewind to some time t â¤ T the number of computation steps that need to be redone to get to t from a checkpoint before t is as few as possible. In particular, we want the closest checkpoint earlier than t to be no farther away from t than qk times the ideal distance T/k + 1, where qk is a small constant. Improving earlier work showing 1 + 1/k â¤ qk â¤ 2, we show that qk can be chosen asymptotically less than 2. We present algorithms with asymptotic discrepancy qk â¤ 1.59 + o1 valid for all k and qk â¤ ln4 + o1 â¤ 1.39 + o1 valid for k being a power of two. Experiments indicate the uniform bound pk â¤ 1.7 for all k. For small k, we show how to use a linear programming approach to compute good checkpointing algorithms. This gives discrepancies of less than 1.55 for all k < 60. We prove the first lower bound that is asymptotically more than 1, namely qk â¥ 1.30-o1. We also show that optimal algorithms yielding the infimum discrepancy exist for all k.

Karl Bringmann | Benjamin Doerr | Adrian Neumann | Jakub Sliacan

[1] Karl Bringmann,et al. Online Checkpointing with Improved Worst-Case Guarantees , 2013, ICALP.

[2] Andrea Walther,et al. Online Checkpointing for Parallel Adjoint Computation in PDEs: Application to Goal-Oriented Adaptivity and Flow Control , 2006, Euro-Par.

[3] Erol Gelenbe,et al. On the Optimum Checkpoint Interval , 1979, JACM.

[4] Lauri Ahlroth,et al. Approximately Uniform Online Checkpointing , 2011, COCOON.

[5] Lauri Ahlroth,et al. Approximately Uniform Online Checkpointing with Bounded Memory , 2013, Algorithmica.

[6] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .

[7] C. V. Ramamoorthy,et al. Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[8] Özalp Babaoglu,et al. On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[9] Madhu Sudan,et al. On-line algorithms for locating checkpoints , 2005, Algorithmica.

[10] Philipp Stumm,et al. New Algorithms for Optimal Online Checkpointing , 2010, SIAM J. Sci. Comput..

[11] Artur Andrzejak,et al. Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[12] Madhu Sudan,et al. Online algorithms for locating checkpoints , 1990, STOC '90.

[13] Adam Dunkels,et al. Sensornet Checkpointing: Enabling Repeatability in Testbeds and Realism in Simulations , 2009, EWSN.