Maximizing system utilization via parallelism management for co-located parallel applications

With an increasing number of cores and memory controllers in multiprocessor platforms, co-location of parallel applications is gaining on importance. Key to achieve good performance is allocating the proper number of threads to co-located applications. This paper presents NuPoCo, a framework for automatically managing parallelism of co-located parallel applications on NUMA multi-socket multi-core systems. NuPoCo maximizes the utilization of CPU cores and memory controllers by dynamically adjusting the number of threads for co-located parallel applications. Evaluated with various scenarios of co-located OpenMP applications on a 64-core AMD and a 72-core Intel machine, NuPoCo achieves a reduction of the total turnaround time by 10-20% compared to the default Linux scheduler and an existing parallelism management policy focusing on CPU utilization only.

[1]  Bhyrav Mutnury,et al.  QuickPath Interconnect (QPI) design and analysis in high speed servers , 2010, 19th Topical Meeting on Electrical Performance of Electronic Packaging and Systems.

[2]  Sandhya Dwarkadas,et al.  Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems , 2015, USENIX Annual Technical Conference.

[3]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Henk Jonkers,et al.  Queueing Models of Parallel Applications: The Glamis Methodology , 1994, Computer Performance Evaluation.

[5]  Hiroshi Sasaki,et al.  Coordinated power-performance optimization in manycores , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[6]  Ayal Zaks,et al.  Parcae: a system for flexible parallel execution , 2012, PLDI.

[7]  Lieven Eeckhout,et al.  Undersubscribed threading on clustered cache architectures , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Michael F. P. O'Boyle,et al.  Celebrating diversity: a mixture of experts approach for runtime mapping in dynamic environments , 2015, PLDI.

[9]  Younghyun Cho,et al.  Adaptive Space-Shared Scheduling for Shared-Memory Parallel Programs , 2015, JSSPP.

[10]  Brian D. Bunday,et al.  Basic queueing theory , 1986 .

[11]  Yong Meng Teo,et al.  Understanding Off-Chip Memory Contention of Parallel Programs in Multicore Systems , 2011, 2011 International Conference on Parallel Processing.

[12]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[13]  Antonello Monti,et al.  Dynamic Co-Scheduling Driven by Main Memory Bandwidth Utilization , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[15]  Josef Weidendorfer,et al.  Case Study on Co-scheduling for HPC Applications , 2015, 2015 44th International Conference on Parallel Processing Workshops.

[16]  Thomas R. Gross,et al.  Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.

[17]  Anant Agarwal,et al.  An operating system for multicore and clouds: mechanisms and implementation , 2010, SoCC '10.

[18]  Kevin Klues,et al.  Tessellation: space-time partitioning in a manycore client OS , 2009 .

[19]  Thomas R. Gross,et al.  Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.

[20]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[21]  Vivien Quéma,et al.  The Linux scheduler: a decade of wasted cores , 2016, EuroSys.

[22]  Virendra J. Marathe,et al.  Callisto: co-scheduling parallel runtime systems , 2014, EuroSys '14.

[23]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Hiroshi Nakamura,et al.  Scalability-based manycore partitioning , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Younghyun Cho,et al.  Online scalability characterization of data-parallel programs on many cores , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[26]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[28]  Nathan Clark,et al.  Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[29]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[30]  Yong Meng Teo,et al.  A Practical Approach for Performance Analysis of Shared-Memory Programs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[31]  Bernhard Egger,et al.  SnuMAP : an Open-Source Trace Profiler for Manycore Systems , 2017 .

[32]  Michael F. P. O'Boyle,et al.  A workload-aware mapping approach for data-parallel programs , 2011, HiPEAC.

[33]  Arun Raman,et al.  Parallelism orchestration using DoPE: the degree of parallelism executive , 2011, PLDI '11.

[34]  Laxmi N. Bhuyan,et al.  ADAPT: A framework for coscheduling multithreaded programs , 2013, TACO.

[35]  Bruce R. Childers,et al.  Using utility prediction models to dynamically choose program thread counts , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[36]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[37]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[38]  Timothy Creech Efficient multiprogramming for multicores with SCAF , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Timothy L. Harris,et al.  Pandia: comprehensive contention-sensitive thread placement , 2017, EuroSys.

[40]  Gurindar S. Sohi,et al.  Adaptive, efficient, parallel execution of parallel programs , 2014, PLDI.