Combining Virtualization, resource characterization, and Resource management to enable efficient high performance compute platforms through intelligent dynamic resource allocation

Improved resource utilization and fault tolerance of large-scale HPC systems can be achieved through fine-grained, intelligent, and dynamic resource (re)allocation. We explore components and enabling technologies applicable to creating a system to provide this capability: specifically 1) Scalable fine-grained monitoring and analysis to inform resource allocation decisions, 2) Virtualization to enable dynamic reconfiguration, 3) Resource management for the combined physical and virtual resources and 4) Orchestration of the allocation, evaluation, and balancing of resources in a dynamic environment. We discuss both general and HPC-centric issues that impact the design of such a system. Finally, we present our prototype system, giving both design details and examples of its application in real-world scenarios.

[1]  Sebastien Goasguen,et al.  A study of a KVM-based cluster for grid computing , 2009, ACM-SE 47.

[2]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[3]  Christian Engelmann,et al.  Proactive process-level live migration in HPC environments , 2008, HiPC 2008.

[4]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Ann C. Gentile,et al.  Resource monitoring and management with OVIS to enable HPC in cloud computing environments , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[6]  V. Chiang,et al.  Eucalyptus , 2008, Economic Botany.

[7]  Minyi Guo,et al.  Process migration for MPI applications based on coordinated checkpoint , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[8]  Tong Liu,et al.  Scheduling strategies for HPC as a service (HPCaaS) , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[9]  김병기,et al.  Xen 가상머신에서 실시간 게스트 도메인들의 효율적인 자원할당 기법 , 2011 .

[10]  Anthony M. Filippi,et al.  Effects of virtualization on a scientific application running a hyperspectral radiative transfer code on virtual machines , 2008, HPCVirt '08.

[11]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[12]  Jackson Mayo,et al.  Methodologies for advance warning of compute cluster problems via statistical analysis: a case study , 2009, Resilience '09.