Efficient network isolation and load balancing in multi-tenant HPC clusters

Abstract Multi-tenancy promises high utilization of available system resources and helps maintaining cost-effective operations for service providers. However, multi-tenant high-performance computing (HPC) infrastructures, like dynamic HPC clouds, bring unique challenges, both associated with providing performance isolation to the tenants, and achieving efficient load-balancing across the network fabric. Each tenant should experience predictable network performance, unaffected by the workload of other tenants. At the same time, it is equally important that the network links are balanced, avoiding network saturation. The network saturation can lead to unpredictable application performance, and a potential loss of profit for the cloud service providers. In this paper, we present two significant extensions to our previously proposed partition-aware fat-tree routing algorithm, pFTree, for InfiniBand-based HPC systems. First, we extend pFTree to incorporate provider defined partition-wise policies that govern how the nodes in different partitions are allowed to share network resources with each other. Second, we present a weighted version of the pFTree routing algorithm, that besides partitions, also takes node traffic characteristics into account to balance load across the network links more evenly. A comprehensive evaluation comprising both real-world experiments and simulations confirms the correctness and feasibility of the proposed extensions.

[1]  Bo Gao,et al.  A Framework for Native Multi-Tenancy Application Development and Management , 2007, The 9th IEEE International Conference on E-Commerce Technology and The 4th IEEE International Conference on Enterprise Computing, E-Commerce and E-Services (CEC-EEE 2007).

[2]  Torsten Hoefler,et al.  The Effect of Network Noise on Large-Scale Collective Communications , 2009, Parallel Process. Lett..

[3]  Abhishek Gupta,et al.  Evaluation of HPC Applications on Cloud , 2011, 2011 Sixth Open Cirrus Summit.

[4]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[5]  Antonio Robles,et al.  Routing in infiniBand/spl trade/ torus network topologies , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[6]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[7]  Raouf Boutaba,et al.  Cloud computing: state-of-the-art and research challenges , 2010, Journal of Internet Services and Applications.

[8]  Alexandru Iosup,et al.  On the Performance Variability of Production Cloud Services , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[9]  Fabrizio Petrini,et al.  k-ary n-trees: high performance networks for massively parallel architectures , 1997, Proceedings 11th International Parallel Processing Symposium.

[10]  Helen J. Wang,et al.  SecondNet: a data center network virtualization architecture with bandwidth guarantees , 2010, CoNEXT.

[11]  Cong Wang,et al.  Security Challenges for the Public Cloud , 2012, IEEE Internet Computing.

[12]  José Duato,et al.  On the Infiniband subnet discovery process , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[13]  Albert G. Greenberg,et al.  EyeQ: Practical Network Performance Isolation at the Edge , 2013, NSDI.

[14]  Dhabaleswar K. Panda,et al.  MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[15]  Albert G. Greenberg,et al.  Sharing the Data Center Network , 2011, NSDI.

[16]  Rachel Householder,et al.  On Cloud-based Oversubscription , 2014, ArXiv.

[17]  Frank Bellosa,et al.  Virtual InfiniBand clusters for HPC clouds , 2012, CloudCP '12.

[18]  Albert G. Greenberg,et al.  Seawall: Performance Isolation for Cloud Datacenter Networks , 2010, HotCloud.

[19]  Gail-Joon Ahn,et al.  Security and Privacy Challenges in Cloud Computing Environments , 2010, IEEE Security & Privacy.

[20]  Xin Yuan,et al.  Oblivious routing in fat-tree based system area networks with uncertain traffic demands , 2009, TNET.

[21]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[22]  Mohan Kumar,et al.  On generalized fat trees , 1995, Proceedings of 9th International Parallel Processing Symposium.

[23]  Hitesh Ballani,et al.  Towards predictable datacenter networks , 2011, SIGCOMM 2011.

[24]  Feroz Zahid,et al.  Partition-Aware Routing to Improve Network Isolation in Infiniband Based Multi-tenant Clusters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[25]  George Varghese,et al.  Netshare and stochastic netshare: predictable bandwidth allocation for data centers , 2012, CCRV.

[26]  Olav Lysne,et al.  vFtree - A Fat-Tree Routing Algorithm Using Virtual Lanes to Alleviate Congestion , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27]  Antonio Robles,et al.  Routing in InfiniBand Torus Network Topologies , 2003 .

[28]  Avinoam Kolodny,et al.  Links as a Service (LaaS): Feeling Alone in the Shared Cloud , 2015, ArXiv.

[29]  Olav Lysne,et al.  Layered shortest path (LASH) routing in irregular system area networks , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[30]  Michael A. Rappa,et al.  The utility business model and the future of computing services , 2004, IBM Syst. J..

[31]  Thomas L. Sterling,et al.  A High-Performance Computing Forecast: Partly Cloudy , 2009, Computing in Science & Engineering.

[32]  Torsten Hoefler,et al.  Deadlock-Free Oblivious Routing for Arbitrary Topologies , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[33]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[34]  José Duato,et al.  QoS in InfiniBand subnetworks , 2004, IEEE Transactions on Parallel and Distributed Systems.

[35]  Olav Lysne,et al.  Efficient and Contention-Free Virtualisation of Fat-Trees , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[36]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[37]  Marius Hillenbrand,et al.  High performance cloud computing , 2013, Future Gener. Comput. Syst..

[38]  José Duato,et al.  Routing in InfiniBandTM Torus Network Topologie. , 2003 .

[39]  Paul Rad,et al.  Low-latency software defined network for high performance clouds , 2015, 2015 10th System of Systems Engineering Conference (SoSE).

[40]  Albert G. Greenberg,et al.  The cost of a cloud: research problems in data center networks , 2008, CCRV.

[41]  Eitan Zahavi Fat-tree routing and node ordering providing contention free traffic for MPI global collectives , 2012, J. Parallel Distributed Comput..

[42]  Darren J. Kerbyson,et al.  Optimized InfiniBand TM fat-tree routing for shift all-to-all communication patterns , 2010, ISC 2010.

[43]  Rami G. Melhem,et al.  Oblivious Routing in Fat-Tree Based System Area Networks With Uncertain Traffic Demands , 2007, IEEE/ACM Transactions on Networking.

[44]  Torsten Hoefler,et al.  Multistage switches are not crossbars: Effects of static routing in high-performance networks , 2008, 2008 IEEE International Conference on Cluster Computing.

[45]  Tharam S. Dillon,et al.  Cloud Computing: Issues and Challenges , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[46]  Cyriel Minkenberg,et al.  Quiet Neighborhoods: Key to Protect Job Performance Predictability , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[47]  Olav Lysne,et al.  dFtree: a fat-tree routing algorithm using dynamic allocation of virtual lanes to alleviate congestion in infiniband networks , 2011, NDM '11.

[48]  Paolo Bientinesi,et al.  HPC on Competitive Cloud Resources , 2010, Handbook of Cloud Computing.

[49]  Feroz Zahid,et al.  A Weighted Fat-Tree Routing Algorithm for Efficient Load-Balancing in Infini Band Enterprise Clusters , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.