A Self-Adaptive Network for HPC Clouds: Architecture, Framework, and Implementation

Clouds offer flexible and economically attractive compute and storage solutions for enterprises. However, the effectiveness of cloud computing for high-performance computing (HPC) systems still remains questionable. When clouds are deployed on lossless interconnection networks, like InfiniBand (IB), challenges related to load-balancing, low-overhead virtualization, and performance isolation hinder full potential utilization of the underlying interconnect. Moreover, cloud data centers incorporate a highly dynamic environment rendering static network reconfigurations, typically used in IB systems, infeasible. In this paper, we present a framework for a self-adaptive network architecture for HPC clouds based on lossless interconnection networks, demonstrated by means of our implemented IB prototype. Our solution, based on a feedback control and optimization loop, enables the lossless HPC network to dynamically adapt to the varying traffic patterns, current resource availability, workload distributions, and also in accordance with the service provider-defined policies. Furthermore, we present IBAdapt, a simplified ruled-based language for the service providers to specify adaptation strategies used by the framework. Our developed self-adaptive IB network prototype is demonstrated using state-of-the-art industry software. The results obtained on a test cluster demonstrate the feasibility and effectiveness of the framework when it comes to improving Quality-of-Service compliance in HPC clouds.

[1]  Kevin T. Pedretti,et al.  A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds , 2017, 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[2]  Fabrizio Petrini,et al.  k-ary n-trees: high performance networks for massively parallel architectures , 1997, Proceedings 11th International Parallel Processing Symposium.

[3]  David Garlan,et al.  Rainbow: architecture-based self-adaptation with reusable infrastructure , 2004 .

[4]  Dhabaleswar K. Panda,et al.  Slurm-V: Extending Slurm for Building Efficient HPC Cloud with SR-IOV and IVShmem , 2016, Euro-Par.

[5]  Avinoam Kolodny,et al.  Distributed Adaptive Routing Convergence to Non-Blocking DCN Routing Assignments , 2014, IEEE Journal on Selected Areas in Communications.

[6]  Z. Ding,et al.  Level-wise Scheduling Algorithm for Fat Tree Interconnection Networks , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[7]  Torsten Hoefler,et al.  ORCS : An Oblivious Routing Congestion Simulator , 2009 .

[8]  Feroz Zahid,et al.  Compact network reconfiguration in fat-trees , 2016, The Journal of Supercomputing.

[9]  Jesper Andersson,et al.  On interacting control loops in self-adaptive systems , 2011, SEAMS '11.

[10]  Radia J. Perlman,et al.  An algorithm for distributed computation of a spanningtree in an extended LAN , 1985, SIGCOMM '85.

[11]  Stephen P. Crago,et al.  Bridging the Virtualization Performance Gap for HPC Using SR-IOV for InfiniBand , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[12]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[13]  Feroz Zahid,et al.  Network Optimization for High Performance Cloud Computing , 2017 .

[14]  Marius Hillenbrand,et al.  High performance cloud computing , 2013, Future Gener. Comput. Syst..

[15]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[16]  Mohan Kumar,et al.  On generalized fat trees , 1995, Proceedings of 9th International Parallel Processing Symposium.

[17]  Petr Jan Horn,et al.  Autonomic Computing: IBM's Perspective on the State of Information Technology , 2001 .

[18]  Rajarshi Das,et al.  Achieving Self-Management via Utility Functions , 2007, IEEE Internet Computing.

[19]  Paolo Bientinesi,et al.  HPC on Competitive Cloud Resources , 2010, Handbook of Cloud Computing.

[20]  Feroz Zahid,et al.  A Weighted Fat-Tree Routing Algorithm for Efficient Load-Balancing in Infini Band Enterprise Clusters , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[21]  Pedro López,et al.  Deterministic versus Adaptive Routing in Fat-Trees , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[22]  Herodotos Herodotou Hadoop Performance Models , 2011, ArXiv.

[23]  Radia Perlman,et al.  An algorithm for distributed computation of a spanningtree in an extended LAN , 1985, SIGCOMM '85.

[24]  Hoefler Torsten,et al.  Scheduling-Aware Routing for Supercomputers , 2016 .

[25]  Sven-Arne Reinemo,et al.  InfiniBand congestion control: modelling and validation , 2011, SimuTools.

[26]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[27]  David Garlan,et al.  Stitch: A language for architecture-based self-adaptation , 2012, J. Syst. Softw..

[28]  Feroz Zahid,et al.  Partition-Aware Routing to Improve Network Isolation in Infiniband Based Multi-tenant Clusters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[29]  Abhishek Gupta,et al.  Evaluation of HPC Applications on Cloud , 2011, 2011 Sixth Open Cirrus Summit.

[30]  Ladan Tahvildari,et al.  Self-adaptive software: Landscape and research challenges , 2009, TAAS.

[31]  Dhabaleswar K. Panda,et al.  MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[32]  Antonio Robles,et al.  Supporting fully adaptive routing in InfiniBand networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[33]  Torsten Hoefler,et al.  Adaptive Routing Strategies for Modern High Performance Networks , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.

[34]  Feroz Zahid,et al.  Efficient network isolation and load balancing in multi-tenant HPC clusters , 2017, Future Gener. Comput. Syst..

[35]  Mary Shaw,et al.  Engineering Self-Adaptive Systems through Feedback Loops , 2009, Software Engineering for Self-Adaptive Systems.

[36]  Frank Bellosa,et al.  Virtual InfiniBand clusters for HPC clouds , 2012, CloudCP '12.

[37]  Thomas L. Sterling,et al.  A High-Performance Computing Forecast: Partly Cloudy , 2009, Computing in Science & Engineering.