Modeling and Analysis of Availability of Datacenter Power Infrastructure

Realizing highly available datacenter power infrastructure is an extremely expensive proposition with costs more than doubling as we move from three 9’s (Tier-1) to six 9’s (Tier-4) of availability. Existing approaches only consider the cost/availability trade-off for a restricted set of power infrastructure configurations, relying mainly on component redundancy. A number of additional knobs such as centralized vs. distributed component placement, power-feed interconnect topology and component capacity over-provisioning also exist, whose impact has only been studied in limited forms. In this paper, we provide a systematic approach to understand the cost/availability trade-off offered by these configuration parameters as a function of supported IT load. We develop detailed datacenter availability models using Continuous-time Markov Chains and Reliability Block Diagrams to quantify the relative impact of these parameters on availability. Using real-world component availability data to parametrize these models, we offer a number of interesting insights into developing costeffective yet highly available power infrastructure. As two salient examples, we find (i) although centralized UPS placement offers high availability, it does so with significant cost, and (ii) distributed server-level UPS placement is much more cost-effective but does not offer meaningful availability for operating the datacenter at full load. Based on these insights, we propose a novel hybrid strategy that combines the server-level UPS placement with a rack-level UPS, achieving as good availability as existing centralized techniques, at just twothirds of its cost.

[1]  F. Bodi “DC-grade” reliability for UPS in telecommunications data centers , 2007, INTELEC 07 - 29th International Telecommunications Energy Conference.

[2]  Kishor S. Trivedi,et al.  Performance and Reliability Analysis of Computer Systems , 1996, Springer US.

[3]  Anand Sivasubramaniam,et al.  Managing server energy and operational costs in hosting centers , 2005, SIGMETRICS '05.

[4]  David E. Irwin,et al.  Ensemble-level Power Management for Dense Blade Servers , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[5]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[6]  Kishor S. Trivedi,et al.  Dependability and Performability Analysis , 1993, Performance/SIGMETRICS Tutorials.

[7]  Karthick Rajamani,et al.  A performance-conserving approach for reducing peak power consumption in server systems , 2005, ICS '05.

[8]  James R. Hamilton,et al.  Internet-scale service infrastructure efficiency , 2009, ISCA '09.

[9]  Anand Sivasubramaniam,et al.  Statistical profiling-based techniques for effective power provisioning in data centers , 2009, EuroSys '09.

[10]  John H. Seader,et al.  Tier Classifications Define Site Infrastructure Performance , 2006 .

[11]  Amin Vahdat,et al.  Managing energy and server resources in hosting centers , 2001, SOSP.

[12]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[13]  Myron Hlynka,et al.  Queueing Networks and Markov Chains (Modeling and Performance Evaluation With Computer Science Applications) , 2007, Technometrics.

[14]  Y. C. Yao,et al.  A direct argument for Kaplansky's theorem on a cyclic arrangement and its generalization , 1991, Oper. Res. Lett..

[15]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[16]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[17]  Thomas F. Wenisch,et al.  Power routing: dynamic power provisioning in the data center , 2010, ASPLOS XV.

[18]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[19]  Gunter Bolch,et al.  Queueing Networks and Markov Chains - Modeling and Performance Evaluation with Computer Science Applications, Second Edition , 1998 .

[20]  Marcel F. Neuts,et al.  Structured Stochastic Matrices of M/G/1 Type and Their Applications , 1989 .

[21]  Virgílio A. F. Almeida,et al.  Quantifying the sustainability impact of data center availability , 2010, PERV.

[22]  M. Chao,et al.  Survey of reliability studies of consecutive-k-out-of-n:F and related systems , 1995 .

[23]  Enrique V. Carrera,et al.  Load balancing and unbalancing for power and performance in cluster-based systems , 2001 .