Unobtrusive power proportionality for HPC frameworks

Building power proportional High Performance Computing (HPC) clusters comprising of servers which are not power-proportional is a well-studied problem, and has the potential to provide large energy savings. However, a large emphasis on maintaining cluster uptime disincentivizes system administrators from deploying prior research techniques that introduce changes to existing software configurations, modify the existing cluster job management framework, change user job submission procedures, or fail in unpredictable ways due to frequent server power cycling[3]. We present Hypnos, a meta-system that tackles the challenge of implementing power proportionality unobtrusively in an HPC cluster with an existing job management framework. Hypnos makes no changes to the existing cluster software or network stack, and uses only the standard interfaces exposed by the existing cluster framework to (a) obtain server state and job information, (b) add/remove servers from the existing framework's purview, (c) infer the cluster's scheduling logic, and (d) handle reliability challenges when servers fail to run jobs, boot up, or race conditions develop between Hypnos and the existing cluster scheduler. We evaluated Hypnos by deploying it on a production HPC cluster running the framework - Torque[4]. Hypnos was able to achieve a 36% reduction in energy consumption (compared to an optimal of 37.5%) while circumventing over 1500 network and software faults over a 21-day deployment.

[1]  David E. Culler,et al.  Hypnos: Unobtrusive Power Proportionality for HPC frameworks , 2014 .

[2]  Mor Harchol-Balter,et al.  The case for sleep states in servers , 2011, HotPower '11.

[3]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[4]  Lachlan L. H. Andrew,et al.  Dynamic Right-Sizing for Power-Proportional Data Centers , 2011, IEEE/ACM Transactions on Networking.

[5]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[6]  Minghong Lin,et al.  Characterizing the impact of the workload on the value of dynamic resizing in data centers , 2015, Perform. Evaluation.

[7]  Thomas F. Wenisch,et al.  PowerNap: eliminating server idle power , 2009, ASPLOS.

[8]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[9]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[10]  Jignesh M. Patel,et al.  Energy management for MapReduce clusters , 2010, Proc. VLDB Endow..

[11]  Thomas F. Wenisch,et al.  DreamWeaver: architectural support for deep sleep , 2012, ASPLOS XVII.

[12]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[13]  WangKai,et al.  Characterizing the impact of the workload on the value of dynamic resizing in data centers , 2012 .

[14]  Klara Nahrstedt,et al.  Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[15]  Adam Wierman,et al.  Renewable and cooling aware workload management for sustainable data centers , 2012, SIGMETRICS '12.

[16]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.

[17]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[18]  Yanpei Chen,et al.  Energy efficiency for large-scale MapReduce workloads with significant interactive analysis , 2012, EuroSys '12.

[19]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.