Thermal, power, and co-location aware resource allocation in heterogeneous high performance computing systems

The rapid increase in power consumption of high performance computing (HPC) systems has led to an increase in the amount of cooling resources required to operate these facilities at a reliable threshold. The cooling systems contribute a large portion of the total power consumption of the facility, thus driving up the costs of providing power to these facilities. In addition, when cores sharing resources (e.g., last-level cache) execute applications at the same time, they can experience contention and therefore performance degradation. By taking a holistic approach to HPC facility management through intelligently allocating both computing and cooling resources, the performance of the HPC system can be maximized by considering co-location while obeying power consumption and thermal constraints. The performance of the system is quantified as the total reward earned from completing tasks by their individual deadlines. We propose three novel resource allocation techniques to maximize performance under power and thermal constraints when considering co-location effects: (1) a greedy heuristic, (2) a genetic algorithm technique used in combination with a new local search technique that guarantees the power and thermal constraints, and (3) a nonlinear programming based approach (from previous work), adapted to consider co-location effects.

[1]  L. Darrell Whitley,et al.  The GENITOR Algorithm and Selection Pressure: Why Rank-Based Allocation of Reproductive Trials is Best , 1989, ICGA.

[2]  Anthony A. Maciejewski,et al.  Characterizing Task-Machine Affinity in Heterogeneous Computing Environments , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Sang Lyul Min,et al.  Energy-centric DVFS controlling method for multi-core platforms , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[4]  Lui Sha,et al.  MemGuard: Memory bandwidth reservation system for efficient performance isolation in multi-core platforms , 2013, 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[5]  Athanasios V. Vasilakos,et al.  Distributed resource management in data center with temperature constraint , 2013, 2013 International Green Computing Conference Proceedings.

[6]  Gregory A. Koenig,et al.  Modeling the Effects on Power and Performance from Memory Interference of Co-located Applications in Multicore Systems , 2014 .

[7]  FengWu-chun,et al.  The Green500 List , 2007 .

[8]  Yan Alexander Li,et al.  Determining the Execution Time Distribution for a Data Parallel Program in a Heterogeneous Computing Environment , 1997, J. Parallel Distributed Comput..

[9]  R. F. Freund,et al.  Dynamic Mapping of a Class of Independent Tasks onto Heterogeneous Computing Systems , 1999, J. Parallel Distributed Comput..

[10]  Lee C. Potter,et al.  Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[11]  Ayan Banerjee,et al.  Cooling-aware and thermal-aware workload placement for green HPC data centers , 2010, International Conference on Green Computing.

[12]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[13]  Jie Liu,et al.  Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines , 2011, SoCC.

[14]  Anthony A. Maciejewski,et al.  Stochastic robustness metric and its use for static resource allocations , 2008, J. Parallel Distributed Comput..

[15]  Kumar Shashi Prabh,et al.  Optimized Thermal-Aware Workload Distribution Considering Allocation Constraints in Data Centers , 2013, 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing.

[16]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Anthony A. Maciejewski,et al.  Power and Thermal-Aware Workload Allocation in Heterogeneous Data Centers , 2015, IEEE Transactions on Computers.

[18]  Matthew D. Jones,et al.  Implementing green technologies and practices in a high performance computing center , 2013, 2013 International Green Computing Conference Proceedings.

[19]  Ladislau Bölöni,et al.  A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems , 2001, J. Parallel Distributed Comput..

[20]  Qinghui Tang,et al.  Sensor-Based Fast Thermal Evaluation Model For Energy Efficient High-Performance Datacenters , 2006, 2006 Fourth International Conference on Intelligent Sensing and Information Processing.