RMC: An integrated runtime system for adaptive many-core computing

Many-core computing has surfaced as a promising solution to satisfy the rapidly increasing computational needs for various areas ranging from embedded to datacenter computing. However, when allocated with an excessive number of cores, multithreaded applications may fail to achieve optimal performance and energy efficiency due to the contention on software and/or hardware resources. While previous research has proposed adaptive techniques such as thread packing (TP) and dynamic threading (DT), they often lead to sub-optimal results because they are used in an isolated manner. To address this problem, we propose RMC, an integrated runtime system for adaptive many-core computing. Guided by the runtime information of parallel applications, RMC dynamically adapts their execution by combining the TP and DT techniques. We apply RMC to six PARSEC benchmarks that use representative parallelism models (i.e., fork-join, task, and pipeline). We demonstrate that RMC is easy to use, considerably outperforms the state-of-the-art techniques for three PARSEC benchmarks, and incurs a small overhead to the rest of the benchmarks.

[1]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[2]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[3]  Lieven Eeckhout,et al.  Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[5]  Eitan Frachtenberg,et al.  Power and performance evaluation of Memcached on the TILEPro64 architecture , 2012, Sustain. Comput. Informatics Syst..

[6]  Sherief Reda,et al.  Pack & Cap: Adaptive DVFS and thread packing under power caps , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Pradeep Dubey,et al.  Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Hiroshi Sasaki,et al.  Coordinated power-performance optimization in manycores , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[9]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[10]  Alexandra Fedorova,et al.  Deconstructing the overhead in parallel applications , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Stijn Eyerman,et al.  Criticality stacks: identifying critical threads in parallel programs using synchronization behavior , 2013, ISCA.

[12]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Rafael Asenjo,et al.  Analytical Modeling of Pipeline Parallelism , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[14]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[16]  Yale N. Patt,et al.  Feedback-directed pipeline parallelism , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Woongki Baek,et al.  HARS: A heterogeneity-aware runtime system for self-adaptive multithreaded applications , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[18]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[19]  David Eklöv A Profiling Method for Analyzing Scalability Bottlenecks on Multicores , 2012 .

[20]  Ben H. H. Juurlink,et al.  Parallel HEVC Decoding on Multi- and Many-core Architectures , 2013, J. Signal Process. Syst..

[21]  Sean Matthew Dorward,et al.  Awarded Best Paper! - Venti: A New Approach to Archival Data Storage , 2002 .

[22]  Onur Mutlu,et al.  Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[23]  Laxmi N. Bhuyan,et al.  Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Stijn Eyerman,et al.  Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[25]  Naga K. Govindaraju,et al.  Challenges and Opportunities in Many-Core Computing , 2008, Proceedings of the IEEE.

[26]  Ronald Fedkiw,et al.  Automatic determination of facial muscle activations from sparse motion capture marker data , 2005, SIGGRAPH '05.

[27]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[28]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.