Smart, adaptive mapping of parallelism in the presence of external workload

Given the wide scale adoption of multi-cores in main stream computing, parallel programs rarely execute in isolation and have to share the platform with other applications that compete for resources. If the external workload is not considered when mapping a program, it leads to a significant drop in performance. This paper describes an automatic approach that combines compile-time knowledge of the program with dynamic runtime workload information to determine the best adaptive mapping of programs to available resources. This approach delivers increased performance for the target application without penalizing the existing workload. This approach is evaluated on NAS and SpecOMP parallel bench-mark programs across a wide range of workload scenarios. On average, our approach achieves performance gain of 1.5× over a state-of-art scheme on a 12 core machine.

[1]  Mary Lou Soffa,et al.  Contention aware execution: online contention detection and response , 2010, CGO '10.

[2]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[3]  Matthew E. Taylor,et al.  Feature selection and policy optimization for distributed instruction placement using reinforcement learning , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Bruce R. Childers,et al.  Using utility prediction models to dynamically choose program thread counts , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[5]  Thomas F. Wenisch,et al.  DreamWeaver: architectural support for deep sleep , 2012, ASPLOS XVII.

[6]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[7]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[8]  Dimitrios S. Nikolopoulos,et al.  Online power-performance adaptation of multithreaded programs using hardware event-based prediction , 2006, ICS '06.

[9]  Michael Voss,et al.  ADAPT: Automated De-coupled Adaptive Program Transformation , 2000, Proceedings 2000 International Conference on Parallel Processing.

[10]  Krishna K. Rangan,et al.  Achieving uniform performance and maximizing throughput in the presence of heterogeneity , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[11]  Michael F. P. O'Boyle,et al.  A workload-aware mapping approach for data-parallel programs , 2011, HiPEAC.

[12]  Sally A. McKee,et al.  An approach to resource-aware co-scheduling for CMPs , 2010, ICS '10.

[13]  Arun Raman,et al.  Parallelism orchestration using DoPE: the degree of parallelism executive , 2011, PLDI '11.

[14]  Michael F. P. O'Boyle,et al.  Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Stijn Eyerman,et al.  Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.

[16]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[18]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[19]  Ayal Zaks,et al.  Parcae: a system for flexible parallel execution , 2012, PLDI.