Prospects and challenges of virtual machine migration in HPC

The continuous growth of supercomputers is accompanied by increased complexity of the intra‐node level and the interconnection topology. Consequently, the whole software stack ranging from the system software to the applications has to evolve, eg, by means of fault tolerance and support for the rising intra‐node parallelism. Migration techniques are one means to address these challenges. On the one hand, they facilitate the maintenance process by enabling the evacuation of individual nodes during runtime, ie, the implementation of fault avoidance. On the other hand, they enable dynamic load balancing for an improvement of the system's efficiency. However, these prospects come along with certain challenges. On the process level, migration mechanisms have to resolve so‐called residual dependencies to the source node, eg, the communication hardware. On the job level, migrations affect the communication topology, which should be addressed by the communication stack, ie, the optimal communication path between a pair of processes might change after a migration. In this article, we explore migration mechanisms for HPC and discuss their prospects as well as the challenges. Furthermore, we present solutions enabling their efficient usage in this domain. Finally, we evaluate our prototype co‐scheduler leveraging migration for workload optimization.

[1]  Y. Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.

[2]  Olav Lysne,et al.  Early experiences with live migration of SR-IOV enabled InfiniBand , 2015, J. Parallel Distributed Comput..

[3]  Peter T. L. Popkowski Leszczyc,et al.  The university of alberta. , 1969, Canadian family physician Medecin de famille canadien.

[4]  André Brinkmann,et al.  Migration Techniques in HPC Environments , 2014, Euro-Par Workshops.

[5]  Wei Huang,et al.  High performance virtual machine migration with RDMA over modern interconnects , 2007, 2007 IEEE International Conference on Cluster Computing.

[6]  Antonello Monti,et al.  Non-intrusive Migration of MPI Processes in OS-Bypass Networks , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7]  Josef Weidendorfer,et al.  Automatic Co-scheduling Based on Main Memory Bandwidth Usage , 2015, JSSPP.

[8]  David Eklov,et al.  Bandwidth Bandit: Quantitative characterization of memory contention , 2012, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[9]  Gene Cooperman,et al.  Transparent checkpoint-restart over infiniband , 2014, HPDC '14.

[10]  Wu-chun Feng,et al.  Transparent Accelerator Migration in a Virtualized GPU Environment , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[11]  Chao Wang,et al.  Proactive process-level live migration and back migration in HPC environments , 2012, J. Parallel Distributed Comput..

[12]  Thomas Moschny,et al.  Dynamic Process Management with Allocation-internal Co-Scheduling towards Interactive Supercomputing , 2016 .

[13]  Antonello Monti,et al.  Dynamic Co-Scheduling Driven by Main Memory Bandwidth Utilization , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[15]  Bronis R. de Supinski,et al.  Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[16]  M. Jette,et al.  Simple Linux Utility for Resource Management , 2009 .

[17]  Jesper Larsson Träff SMP-aware message passing programming , 2003, Eighth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2003. Proceedings..

[18]  Hiroaki Kobayashi,et al.  CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19]  Jesper Larsson Träff,et al.  Improved MPI All-to-all Communication on a Giganet SMP Cluster , 2002, PVM/MPI.

[20]  Balázs Goldschmidt,et al.  Performance Analysis of Cloud-Based Application , 2013, LSSC.

[21]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[22]  Mary Lou Soffa,et al.  Contention aware execution: online contention detection and response , 2010, CGO '10.

[23]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[24]  Amith R. Mamidala,et al.  Efficient SMP-aware MPI-level broadcast over InfiniBand's hardware multicast , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[25]  Yeh-Ching Chung,et al.  InfiniBand virtualization on KVM , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[26]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[27]  Rachel Householder,et al.  On Cloud-based Oversubscription , 2014, ArXiv.

[28]  Stefan Lankes,et al.  Implications of Process-Migration in Virtualized Environments , 2016, COSH@HiPEAC.

[29]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[30]  Antonello Monti,et al.  Enabling hierarchy-aware MPI collectives in dynamically changing topologies , 2017, EuroMPI/USA.

[31]  Johan Tordsson,et al.  Reducing Noisy-Neighbor Impact with a Fuzzy Affinity-Aware Scheduler , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[32]  Dhabaleswar K. Panda,et al.  Can Inter-VM Shmem Benefit MPI Applications on SR-IOV Based Virtualized Infiniband Clusters? , 2014, Euro-Par.

[33]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[34]  Dhabaleswar K. Panda,et al.  Nomad: migrating OS-bypass networks in virtual machines , 2007, VEE '07.

[35]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[36]  Nick Knupffer Intel Corporation , 2018, The Grants Register 2019.

[37]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[38]  Dhabaleswar K. Panda,et al.  Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[39]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[40]  Antonello Monti,et al.  Viability of Virtual Machines in HPC - A State of the Art Analysis , 2016, Euro-Par Workshops.

[41]  Dhabaleswar K. Panda,et al.  Slurm-V: Extending Slurm for Building Efficient HPC Cloud with SR-IOV and IVShmem , 2016, Euro-Par.

[42]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[43]  Giovanni Toffetti Carughi,et al.  An Adaptive Utilization Accelerator for Virtualized Environments , 2014, 2014 IEEE International Conference on Cloud Engineering.

[44]  Dhabaleswar K. Panda,et al.  High performance MPI library over SR-IOV enabled infiniband clusters , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[45]  Gil Neiger,et al.  Intel ® Virtualization Technology for Directed I/O , 2006 .

[46]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[47]  Thomas Hérault,et al.  Process Distance-Aware Adaptive MPI Collective Communications , 2011, 2011 IEEE International Conference on Cluster Computing.

[48]  Antonello Monti,et al.  A Locality-Aware Communication Layer for Virtualized Clusters , 2017, ISC Workshops.

[49]  Paul Lu,et al.  Shared-memory optimizations for virtual machines , 2011 .

[50]  Josef Weidendorfer,et al.  Case Study on Co-scheduling for HPC Applications , 2015, 2015 44th International Conference on Parallel Processing Workshops.

[51]  Laurent Lefèvre,et al.  Exploiting performance counters to predict and improve energy performance of HPC systems , 2014, Future Gener. Comput. Syst..

[52]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[53]  Keith D. Underwood,et al.  Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[54]  Hiroaki Kobayashi,et al.  CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[55]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[56]  Stefan Lankes,et al.  Application migration in HPC — A driver of the exascale era? , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[57]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..

[58]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[59]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[60]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[61]  Thomas Lundqvist,et al.  Addressing characterization methods for memory contention aware co-scheduling , 2014, The Journal of Supercomputing.