How Much SSD Is Useful for Resilience in Supercomputers

We consider the use of non-volatile memories in the form of burst buffers for resilience in supercomputers. Their cost and limited lifetime demand effective use and appropriate provisioning. We develop an analytic model for the behavior of workloads on systems with burst buffers, and use it to explore questions of cost-effective provisioning, and mission-directed allocation of burst-buffer (SSD) lifetime. First, our results show that system efficiency can be increased by as much as 14% by considering a global perspective (workload mix, job size) for SSD lifetime allocation. Second, with size-based and system-efficiency based lifetime allocation, large jobs suffer as much as 40% job efficiency loss; job-efficiency based allocation must increase their allocations by 50% to eliminate this disparity. Finally, further results suggest that under provisioning SSD lifetime (only 10-20% of the "optimum" as defined by per-job requirements without resource constraint) is sufficient to produce 90% system efficiency at failure rates three times that of current systems.

[1]  John Daly A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.

[2]  Bu-Sung Lee,et al.  Cost Minimization for Provisioning Virtual Servers in Amazon Elastic Compute Cloud , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[3]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[4]  Mohammad A. Khaleel Scientific Grand Challenges: Crosscutting Technologies for Computing at the Exascale - February 2-4, 2010, Washington, D.C. , 2011 .

[5]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[6]  Victor F. Nicola,et al.  Checkpointing and the modeling of program execution time , 1994 .

[7]  Dhabaleswar K. Panda,et al.  Enhancing Checkpoint Performance with Staging IO and SSD , 2010, 2010 International Workshop on Storage Network Architecture and Parallel I/Os.

[8]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  Derong Shen,et al.  A Throughput Driven Task Scheduler for Improving MapReduce Performance in Job-Intensive Environments , 2013, 2013 IEEE International Congress on Big Data.

[10]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[11]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[14]  John Bent,et al.  Storage challenges at Los Alamos National Lab , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[17]  Satoshi Matsuoka,et al.  A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[18]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[19]  Cho-Li Wang,et al.  Error-Tolerant Resource Allocation and Payment Minimization for Cloud System , 2013, IEEE Transactions on Parallel and Distributed Systems.

[20]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[22]  Erol Gelenbe,et al.  A model of roll-back recovery with multiple checkpoints , 1976, ICSE '76.

[23]  Takeo Kanade,et al.  High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation , 2014, Lecture Notes in Computer Science.

[24]  Nitin H. Vaidya A Case of Multi-Level Distributed Recovery Schemes , 2001 .

[25]  S. Leyffer,et al.  Software for Nonlinearly Constrained Optimization , 2011 .

[26]  Qing Zhang,et al.  Job Scheduling Optimization for Multi-user MapReduce Clusters , 2011, 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming.

[27]  Thomas Hérault,et al.  Optimal Checkpointing Period: Time vs. Energy , 2013, PMBS@SC.

[28]  Yinglin Wang,et al.  A round robin with multiple feedback job scheduler in Hadoop , 2014, 2014 IEEE International Conference on Progress in Informatics and Computing.

[29]  Andrew A. Chien,et al.  Moore's Law: The First Ending and a New Beginning , 2013, Computer.

[30]  Edward G. Coffman,et al.  Scheduling Checks and Saves , 1992, INFORMS J. Comput..

[31]  Michael Lang,et al.  The design and implementation of a multi-level content-addressable checkpoint file system , 2012, 2012 19th International Conference on High Performance Computing.

[32]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[33]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .

[34]  Laxmikant V. Kalé,et al.  Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.