Toward Runtime Power Management of Exascale Networks by on/off Control of Links

Higher radix networks, such as high-dimensional tori and multi-level directly connected networks, are being used for supercomputers as they become larger but need lower diameter. These networks have more resources (e.g. links) in order to provide good performance for a range of applications. We observe that a sizeable fraction of the links in the interconnect are never used or underutilized during execution of common parallel applications. Thus, in order to save power, we propose addition of hardware support for on/off control of links in software and their management using adaptive runtime systems. We study the effectiveness of our approach using real applications (NAMD, MILC), and application benchmarks (NAS Parallel Benchmarks, Jacobi). They are simulated on representative topologies such as 6-D Torus and Dragonfly (e.g. IBM PERCS, Cray Aries). For common applications, our approach can save up to 16% of total machine's power and energy, without any performance penalty.

[1]  V. Soteriou,et al.  Regulating Power Regulating Power-Aware On/Off Aware On/Off , 2007 .

[2]  Atul K. Jain,et al.  Modeling the effects of two different land cover change data sets on the carbon stocks of plants and soils in concert with CO2 and climate change , 2005 .

[3]  Mike Higgins,et al.  Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Laxmikant V. Kalé,et al.  Simulation-Based Performance Analysis and Tuning for a Two-Level Directly Connected System , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[5]  Laxmikant V. Kalé,et al.  Periodic hierarchical load balancing for large supercomputers , 2011, Int. J. High Perform. Comput. Appl..

[6]  Courtenay T. Vaughan,et al.  Energy based performance tuning for large scale high performance computing systems , 2012, HiPC 2012.

[7]  Laxmikant V. Kalé,et al.  "Cool" Load Balancing for High Performance Computing Data Centers , 2012, IEEE Trans. Computers.

[8]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[9]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[10]  Li Shang,et al.  PowerHerd: a distributed scheme for dynamically satisfying peak-power constraints in interconnection networks , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[11]  Fumiyoshi Shoji,et al.  The K computer: Japanese next-generation supercomputer development project , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[12]  Li-Shiuan Peh,et al.  Software-directed power-aware interconnection networks , 2007, ACM Trans. Archit. Code Optim..

[13]  Laxmikant V. Kalé,et al.  Scalable Algorithms for Distributed-Memory Adaptive Mesh Refinement , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[14]  Li-Shiuan Peh,et al.  Exploring the Design Space of Self-Regulating Power-Aware On/Off Interconnection Networks , 2007, IEEE Transactions on Parallel and Distributed Systems.

[15]  C. DeTar,et al.  Scaling tests of the improved Kogut-Susskind quark action , 1999, hep-lat/9912018.

[16]  Holger Fröning,et al.  NAnoscale Molecular Dynamics (NAMD) , 2011, Encyclopedia of Parallel Computing.

[17]  Li Shang,et al.  Dynamic voltage scaling with links for power optimization of interconnection networks , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[18]  Pedro López,et al.  Dynamic power saving in fat-tree interconnection networks using on/off links , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[19]  Laxmikant V. Kalé,et al.  BigSim: a parallel simulator for performance prediction of extremely large parallel machines , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[20]  Sujata Banerjee,et al.  Energy Aware Network Operations , 2009, IEEE INFOCOM Workshops 2009.

[21]  Laxmikant V. Kalé,et al.  Avoiding hot-spots on two-level direct networks , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Hong Liu,et al.  Energy proportional datacenter networks , 2010, ISCA.

[23]  Juan Chen,et al.  Network Energy Optimization for MPI Operations , 2012, 2012 Fifth International Conference on Intelligent Computation Technology and Automation.

[24]  Josep Torrellas,et al.  Comparing the power and performance of Intel's SCC to state-of-the-art CPUs and GPUs , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[25]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[26]  Laxmikant V. Kalé,et al.  NAMD (NAnoscale Molecular Dynamics) , 2011, Encyclopedia of Parallel Computing.

[27]  Mary Jane Irwin,et al.  Link Shutdown Opportunities During Collective Communications in 3-D Torus Nets , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[28]  Sujata Banerjee,et al.  ElasticTree: Saving Energy in Data Center Networks , 2010, NSDI.

[29]  Tomohiro Inoue,et al.  The Tofu Interconnect , 2012, IEEE Micro.

[30]  Torsten Hoefler Software and Hardware Techniques for Power-Efficient HPC Networking , 2010, Computing in Science & Engineering.

[31]  David H. Bailey,et al.  NAS parallel benchmark results , 1992, Proceedings Supercomputing '92.

[32]  Jian Li,et al.  Power shifting in Thrifty Interconnection Network , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[33]  Laxmikant V. Kalé,et al.  Optimizing communication for Charm++ applications by reducing network contention , 2011, Concurr. Comput. Pract. Exp..