To Cloudify or Not to Cloudify: The Question for a Scientific Data Center

The idea of turning data centers executing scientific batch jobs into private clouds is as attractive as troubling. Cloud platforms may help both in limiting power consumption and in implementing fault tolerance strategies. However, there is also the fear that performance may worsen, and that the electricity required for longer job duration and fault tolerance implementation may overcome the saved one. In this paper, we present the consumability analysis for assessing the impact of cloud and fault tolerance tunings on scientific processing systems. The analysis considers performance, consumption, and dependability aspects, jointly. The aim is to pinpoint if, for a given system, there is a setting where consumption and job failure rate decrease, while performance is not affected. Applied to the scientific data center at our University, the analysis allowed us to find the proper selection of virtual machines' configuration, consolidation strategy, and fault tolerance tuning.

[1]  Ian H. Witten,et al.  Data Mining: Practical Machine Learning Tools and Techniques, 3/E , 2014 .

[2]  Thu D. Nguyen,et al.  Reducing electricity cost through virtual machine placement in high performance computing clouds , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Robert Birke,et al.  Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[4]  Meng Wang,et al.  Consolidating virtual machines with dynamic bandwidth demand in data centers , 2011, 2011 Proceedings IEEE INFOCOM.

[5]  Kishor S. Trivedi,et al.  Scalable Analytics for IaaS Cloud Availability , 2014, IEEE Transactions on Cloud Computing.

[6]  Kishor S. Trivedi,et al.  Towards fast OS rejuvenation: An experimental evaluation of fast OS reboot techniques , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[7]  Wu-chun Feng,et al.  Statistical Power and Performance Modeling for Optimizing the Energy Efficiency of Scientific Computing , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[8]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[9]  Kishor S. Trivedi,et al.  Stochastic Model Driven Capacity Planning for an Infrastructure-as-a-Service Cloud , 2014, IEEE Transactions on Services Computing.

[10]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Roberto Bifulco,et al.  GaaS: Customized Grids in the Clouds , 2012, Euro-Par Workshops.

[12]  Kishor S. Trivedi,et al.  Automated Generation and Analysis of Markov Reward Models Using Stochastic Reward Nets , 1993 .

[13]  Cui Lin,et al.  Designing and Deploying a Scientific Computing Cloud Platform , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[14]  Flavio Frattini,et al.  CONSUMABILITY ANALYSIS OF BATCH PROCESSING SYSTEMS , 2014 .

[15]  Kishor S. Trivedi,et al.  SPNP: Stochastic Petri Nets. Version 6.0 , 2000, Computer Performance Evaluation / TOOLS.

[16]  Kishor S. Trivedi,et al.  Analysis of bugs in Apache Virtual Computing Lab , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[17]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Hai Jin,et al.  Lifetime or energy: Consolidating servers with reliability control in virtualized cloud datacenters , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[19]  Jun Zhu,et al.  Optimizing the Performance of Virtual Machine Synchronization for Fault Tolerance , 2011, IEEE Transactions on Computers.

[20]  Xiaomin Zhu,et al.  Real-Time Tasks Oriented Energy-Aware Scheduling in Virtualized Clouds , 2014, IEEE Transactions on Cloud Computing.

[21]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[22]  Qian Zhu,et al.  Power-Aware Consolidation of Scientific Workflows in Virtualized Environments , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Meeta Sharma Gupta,et al.  Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[24]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[25]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[26]  Domenico Cotroneo,et al.  Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[27]  Raj Jain,et al.  The Art of Computer Systems Performance Analysis : Tech-niques for Experimental Design , 1991 .

[28]  Alexandru Iosup,et al.  Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[29]  Rina Panigrahy,et al.  Validating Heuristics for Virtual Machines Consolidation , 2011 .

[30]  Hsien-Hsin S. Lee,et al.  Migration energy-aware workload consolidation in enterprise clouds , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[31]  Mohsen Sharifi,et al.  Improving Software Dependability Using System-Level Virtualization: A Survey , 2010, 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops.

[32]  Giuseppe Serazzi,et al.  A Characterization of the Variation in Time of Workload Arrival Patterns , 1985, IEEE Transactions on Computers.

[33]  Domenico Cotroneo,et al.  Cost-Benefit Analysis of Virtualizing Batch Systems: Performance-Energy-Dependability Trade-Offs , 2013, 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing.

[34]  Filip De Turck,et al.  Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids , 2009, IEEE Transactions on Parallel and Distributed Systems.

[35]  Renato J. O. Figueiredo,et al.  VMPlants: Providing and Managing Virtual Machine Execution Environments for Grid Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[36]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[37]  Patricia A. Kovatch,et al.  Scheduling diverse high performance computing systems with the goal of maximizing utilization , 2011, 2011 18th International Conference on High Performance Computing.

[38]  Akshat Verma,et al.  Power-aware dynamic placement of HPC applications , 2008, ICS '08.

[39]  Philippe Olivier Alexandre Navaux,et al.  High Performance Computing in the cloud: Deployment, performance and cost efficiency , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[40]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[41]  Rajkumar Buyya,et al.  Energy-aware resource allocation heuristics for efficient management of data centers for Cloud computing , 2012, Future Gener. Comput. Syst..

[42]  Domenico Cotroneo,et al.  Performance degradation analysis of a supercomputer , 2013, 2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[43]  Michela Meo,et al.  Probabilistic Consolidation of Virtual Machines in Self-Organizing Cloud Data Centers , 2013, IEEE Transactions on Cloud Computing.

[44]  Eduard Ayguadé,et al.  A Systematic Methodology to Generate Decomposable and Responsive Power Models for CMPs , 2013, IEEE Transactions on Computers.

[45]  Franck Cappello,et al.  BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[46]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[47]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.