论文信息 - Maximizing system utilization via parallelism management for co-located parallel applications

Maximizing system utilization via parallelism management for co-located parallel applications

With an increasing number of cores and memory controllers in multiprocessor platforms, co-location of parallel applications is gaining on importance. Key to achieve good performance is allocating the proper number of threads to co-located applications. This paper presents NuPoCo, a framework for automatically managing parallelism of co-located parallel applications on NUMA multi-socket multi-core systems. NuPoCo maximizes the utilization of CPU cores and memory controllers by dynamically adjusting the number of threads for co-located parallel applications. Evaluated with various scenarios of co-located OpenMP applications on a 64-core AMD and a 72-core Intel machine, NuPoCo achieves a reduction of the total turnaround time by 10-20% compared to the default Linux scheduler and an existing parallelism management policy focusing on CPU utilization only.

[1] Bhyrav Mutnury,et al. QuickPath Interconnect (QPI) design and analysis in high speed servers , 2010, 19th Topical Meeting on Electrical Performance of Electronic Packaging and Systems.

[2] Sandhya Dwarkadas,et al. Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems , 2015, USENIX Annual Technical Conference.

[3] Jaejin Lee,et al. Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[4] Henk Jonkers,et al. Queueing Models of Parallel Applications: The Glamis Methodology , 1994, Computer Performance Evaluation.

[5] Hiroshi Sasaki,et al. Coordinated power-performance optimization in manycores , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[6] Ayal Zaks,et al. Parcae: a system for flexible parallel execution , 2012, PLDI.

[7] Lieven Eeckhout,et al. Undersubscribed threading on clustered cache architectures , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8] Michael F. P. O'Boyle,et al. Celebrating diversity: a mixture of experts approach for runtime mapping in dynamic environments , 2015, PLDI.

[9] Younghyun Cho,et al. Adaptive Space-Shared Scheduling for Shared-Memory Parallel Programs , 2015, JSSPP.

[10] Brian D. Bunday,et al. Basic queueing theory , 1986 .

[11] Yong Meng Teo,et al. Understanding Off-Chip Memory Contention of Parallel Programs in Multicore Systems , 2011, 2011 International Conference on Parallel Processing.

[12] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[13] Antonello Monti,et al. Dynamic Co-Scheduling Driven by Main Memory Bandwidth Utilization , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[14] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[15] Josef Weidendorfer,et al. Case Study on Co-scheduling for HPC Applications , 2015, 2015 44th International Conference on Parallel Processing Workshops.

[16] Thomas R. Gross,et al. Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.

[17] Anant Agarwal,et al. An operating system for multicore and clouds: mechanisms and implementation , 2010, SoCC '10.

[18] Kevin Klues,et al. Tessellation: space-time partitioning in a manycore client OS , 2009 .

[19] Thomas R. Gross,et al. Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.

[20] Manoj Franklin,et al. Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[21] Vivien Quéma,et al. The Linux scheduler: a decade of wasted cores , 2016, EuroSys.

[22] Virendra J. Marathe,et al. Callisto: co-scheduling parallel runtime systems , 2014, EuroSys '14.

[23] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24] Hiroshi Nakamura,et al. Scalability-based manycore partitioning , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25] Younghyun Cho,et al. Online scalability characterization of data-parallel programs on many cores , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[26] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[27] Adrian Schüpbach,et al. The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[28] Nathan Clark,et al. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[29] Alexandra Fedorova,et al. Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[30] Yong Meng Teo,et al. A Practical Approach for Performance Analysis of Shared-Memory Programs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[31] Bernhard Egger,et al. SnuMAP : an Open-Source Trace Profiler for Manycore Systems , 2017 .

[32] Michael F. P. O'Boyle,et al. A workload-aware mapping approach for data-parallel programs , 2011, HiPEAC.

[33] Arun Raman,et al. Parallelism orchestration using DoPE: the degree of parallelism executive , 2011, PLDI '11.

[34] Laxmi N. Bhuyan,et al. ADAPT: A framework for coscheduling multithreaded programs , 2013, TACO.

[35] Bruce R. Childers,et al. Using utility prediction models to dynamically choose program thread counts , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[36] No License,et al. Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[37] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[38] Timothy Creech. Efficient multiprogramming for multicores with SCAF , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39] Timothy L. Harris,et al. Pandia: comprehensive contention-sensitive thread placement , 2017, EuroSys.

[40] Gurindar S. Sohi,et al. Adaptive, efficient, parallel execution of parallel programs , 2014, PLDI.