Quiet Neighborhoods: Key to Protect Job Performance Predictability

Interference of nearby jobs has been recently identified as the dominant reason for the high performance variability of parallel applications running on High Performance Computing (HPC) systems. Typically, HPC systems are dynamic with multiple jobs coming and leaving in an unpredictable fashion, sharing simultaneously the system interconnection network. In such environment contention for network resources is causing random stalls in the progress of application execution degrading application and system performance overall. Eliminating job interactions in their neighbourhoods is key for guaranteeing performance predictability of applications. In this paper we are proposing the concept of quiet neighbourhoods that significantly reduce job interactions. Quiet neighbourhoods are created by the system resource manager in two phases. First, multiple virtual network blocks are defined on the top of the physical network resources based on typical workload distributions. Second, newly arriving jobs are allocated in these virtual blocks based on their size.

[1]  Haihang You,et al.  Comprehensive Workload Analysis and Modeling of a Petascale Supercomputer , 2012, JSSPP.

[2]  Javier Navaridas,et al.  Effects of Topology-Aware Allocation Policies on Scheduling Performance , 2009, JSSPP.

[3]  Dror G. Feitelson,et al.  Packing Schemes for Gang Scheduling , 1996, JSSPP.

[4]  T. Srinivasan,et al.  A Minimal Fragmentation Algorithm for Task Allocation in Mesh-Connected Multicomputers , 2004 .

[5]  Thomas R. Gross,et al.  Impact of Job Mix on Optimizations for Space Sharing Schedulers , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[6]  Mohan Kumar,et al.  On generalized fat trees , 1995, Proceedings of 9th International Parallel Processing Symposium.

[7]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[8]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Fabrizio Petrini,et al.  k-ary n-trees: high performance networks for massively parallel architectures , 1997, Proceedings 11th International Parallel Processing Symposium.

[10]  Olav Lysne,et al.  Efficient and Contention-Free Virtualisation of Fat-Trees , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[11]  Jesús Labarta,et al.  Effective Quality-of-Service Policy for Capacity High-Performance Computing Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[12]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Olav Lysne,et al.  Interconnection Networks: Architectural Challenges for Utility Computing Data Centers , 2008, Computer.

[14]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[15]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[16]  Yehuda Koren,et al.  Lessons from the Netflix prize challenge , 2007, SKDD.

[17]  Tipp Moseley,et al.  Measuring interference between live datacenter applications , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Javier Navaridas,et al.  Effects of Job and Task Placement on Parallel Scientific Applications Performance , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[19]  Tarek M. Sobh,et al.  Innovations and Advanced Techniques in Computer and Information Sciences and Engineering , 2007 .

[20]  Jesús Labarta,et al.  On the trade-off of mixing scientific applications on capacity high-performance computing systems , 2013, IET Comput. Digit. Tech..