Dynamic concurrency throttling on NUMA systems and data migration impacts

Many parallel applications do not scale as the number of threads increases, which means that using the maximum number of threads will not always deliver the best outcome in performance or energy consumption. Therefore, many works have already proposed strategies for tuning the number of threads to optimize for performance or energy. Since parallel applications may have more than one parallel region, these tuning strategies can determine a specific number of threads for each application’s parallel region, or determine a fixed number of threads for the whole application execution. In the former case, strategies apply Dynamic Concurrency Throttling (DCT), which enables adapting the number of threads at runtime. However, the use of DCT implies on overheads, such as creating/destroying threads and cache warm-up. DCT’s overhead can be further aggravated in Non-uniform Memory Access systems, where changing the number of threads may incur in remote memory accesses or, more importantly, data migration between nodes. In this way, tuning strategies should not only determine the best number of threads locally, for each parallel region, but also be aware of the impacts when applying DCT. This work investigates how parallel regions may influence each other during DCT employment, showing that data migration may represent a considerable overhead. Effectively, those overheads affect the strategy’s solution, impacting the overall application performance and energy consumption. We demonstrate why many approaches will very likely fail when applied to simulated environments or will hardly reach a near-optimum solution when executed in real hardware.

[1]  Philippe Olivier Alexandre Navaux,et al.  Potential Gains in EDP by Dynamically Adapting the Number of Threads for OpenMP Applications in Embedded Systems , 2017, 2017 VII Brazilian Symposium on Computing Systems Engineering (SBESC).

[2]  Samuel Thibault,et al.  Structuring the execution of OpenMP applications for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Dimitrios S. Nikolopoulos,et al.  Application-Level Energy Awareness for OpenMP , 2015, IWOMP.

[4]  Peter Arbenz,et al.  Introduction to Parallel Computing (Oxford Texts in Applied and Engineering Mathematics) , 2004 .

[5]  Vivien Quéma,et al.  Thread and Memory Placement on NUMA Systems: Asymmetry Matters , 2015, USENIX Annual Technical Conference.

[6]  Israel Koren,et al.  Affinity-Based Thread and Data Mapping in Shared Memory Systems , 2016, ACM Comput. Surv..

[7]  S SohiGurindar,et al.  Adaptive, efficient, parallel execution of parallel programs , 2014 .

[8]  Philippe Olivier Alexandre Navaux,et al.  Locality vs. Balance: Exploring Data Mapping Policies on NUMA Systems , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[9]  Dimitrios S. Nikolopoulos,et al.  Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes , 2008, IEEE Transactions on Parallel and Distributed Systems.

[10]  Antonio Carlos Schneider Beck,et al.  LAANT: A library to automatically optimize EDP for OpenMP applications , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[11]  Hermann Härtig,et al.  Measuring energy consumption for short code paths using RAPL , 2012, PERV.

[12]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[13]  Philippe Olivier Alexandre Navaux,et al.  kMAF: Automatic kernel-level management of thread and data affinity , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[14]  Wei Wang,et al.  Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[15]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Feedback-driven threading , 2008 .

[17]  Michael J. Quinn,et al.  Parallel programming in C with MPI and OpenMP , 2003 .

[18]  Brice Goglin,et al.  ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures , 2010, International Journal of Parallel Programming.

[19]  Antonio Carlos Schneider Beck,et al.  Investigating different general-purpose and embedded multicores to achieve optimal trade-offs between performance and energy , 2016, J. Parallel Distributed Comput..

[20]  Steven K. Reinhardt,et al.  The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[21]  Daniele De Sensi Predicting Performance and Power Consumption of Parallel Applications , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[22]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[23]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[24]  Scott A. Mahlke,et al.  When less is more (LIMO):controlled parallelism forimproved efficiency , 2012, CASES '12.

[25]  Antonio Carlos Schneider Beck,et al.  Aurora: Seamless Optimization of OpenMP Applications , 2019, IEEE Transactions on Parallel and Distributed Systems.

[26]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[27]  Stephen L. Olivier,et al.  Power Measurement and Concurrency Throttling for Energy Reduction in OpenMP Programs , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[28]  Laxmi N. Bhuyan,et al.  Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[29]  Philippe Olivier Alexandre Navaux,et al.  Characterizing communication and page usage of parallel applications for thread and data mapping , 2015, Perform. Evaluation.

[30]  Dimitrios S. Nikolopoulos,et al.  Online power-performance adaptation of multithreaded programs using hardware event-based prediction , 2006, ICS '06.

[31]  Nathan Clark,et al.  Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[32]  Antonio Carlos Schneider Beck,et al.  Optimized Use of Parallel Programming Interfaces in Multithreaded Embedded Architectures , 2015, 2015 IEEE Computer Society Annual Symposium on VLSI.

[33]  Marco Danelutto,et al.  A Reconfiguration Algorithm for Power-Aware Parallel Applications , 2016, ACM Trans. Archit. Code Optim..

[34]  Barbara M. Chapman,et al.  ARCS: Adaptive Runtime Configuration Selection for Power-Constrained OpenMP Applications , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[35]  Antonio Carlos Schneider Beck,et al.  Parallel Computing Hits the Power Wall - Principles, Challenges, and a Survey of Solutions , 2019, Springer Briefs in Computer Science.

[36]  Onur Mutlu,et al.  MISE: Providing performance predictability and improving fairness in shared main memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[37]  Luigi Carro,et al.  Adaptable Embedded Systems , 2012 .

[38]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[39]  W. P. Petersen,et al.  Introduction to Parallel Computing , 2004 .

[40]  Jaejin Lee,et al.  Adaptive execution techniques for SMT multiprocessor architectures , 2005, PPOPP.