A workload-aware mapping approach for data-parallel programs

Much compiler-orientated work in the area of mapping parallel programs to parallel architectures has ignored the issue of external workload. Given that the majority of platforms will not be dedicated to just one task at a time, the impact of other jobs needs to be addressed. As mapping is highly dependent on the underlying machine, a technique that is easily portable across platforms is also desirable. In this paper we develop an approach for predicting the optimal number of threads for a given data-parallel application in the presence of external workload. We achieve 93.7% of the maximum speedup available which gives an average speedup of 1.66 on 4 cores, a factor 1.24 times better than the OpenMP compiler's default policy. We also develop an alternative cooperative model that minimizes the impact on external workload while still giving an improved average speedup. Finally, we evaluate our approach on a separate 8-core machine giving an average 1.33 times speedup over the default policy showing the portability of our approach.

[1]  Carla E. Brodley,et al.  Learning to Schedule Straight-Line Code , 1997, NIPS.

[2]  G. N. Srinivasa Prasanna,et al.  Generalised multiprocessor scheduling using optimal control , 1991, SPAA '91.

[3]  G. N. Srinivasa Prasanna,et al.  Generalized multiprocessor scheduling for directed acyclic graphs , 1994, Proceedings of Supercomputing '94.

[4]  Eleftherios D. Polychronopoulos,et al.  An Efficient Kernel-Level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors , 1999 .

[5]  Sally A. McKee,et al.  An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.

[6]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[7]  Jaejin Lee,et al.  Adaptive execution techniques for SMT multiprocessor architectures , 2005, PPOPP.

[8]  Stijn Eyerman,et al.  Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.

[9]  Dimitrios S. Nikolopoulos,et al.  Online power-performance adaptation of multithreaded programs using hardware event-based prediction , 2006, ICS '06.

[10]  Mark Stephenson,et al.  Predicting unroll factors using supervised classification , 2005, International Symposium on Code Generation and Optimization.

[11]  Xipeng Shen,et al.  Combining Locality Analysis with Online Proactive Job Co-scheduling in Chip Multiprocessors , 2010, HiPEAC.

[12]  Grigori Fursin,et al.  A Cost-Aware Parallel Workload Allocation Approach Based on Machine Learning Techniques , 2007, NPC.

[13]  Alan Sussman Model-driven mapping onto distributed memory parallel computers , 1992, Proceedings Supercomputing '92.

[14]  Tim Brecht,et al.  Using Parallel Program Characteristics in Dynamic Processor Allocation Policies , 1996, Perform. Evaluation.

[15]  Dean M. Tullsen,et al.  Compiler Techniques for Reducing Data Cache Miss Rate on a Multithreaded Architecture , 2008, HiPEAC.

[16]  Paul M. Carpenter,et al.  Mapping stream programs onto heterogeneous multiprocessor systems , 2009, CASES '09.

[17]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[18]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[19]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[20]  Jesús Labarta,et al.  Performance-driven processor allocation , 2000, IEEE Transactions on Parallel and Distributed Systems.

[21]  Gerard J. M. Smit,et al.  Run-time Spatial Mapping of Streaming Applications to a Heterogeneous Multi-Processor System-on-Chip (MPSOC) , 2007, 2008 Design, Automation and Test in Europe.

[22]  Francisco J. Cazorla,et al.  Thread to Core Assignment in SMT On-Chip Multiprocessors , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[23]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[24]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[25]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[26]  Michael Ott,et al.  autopin - Automated Optimization of Thread-to-Core Pinning on Multicore Systems , 2011, Trans. High Perform. Embed. Archit. Compil..

[27]  Michael F. P. O'Boyle,et al.  Using machine learning to focus iterative optimization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[28]  Dean M. Tullsen,et al.  Symbiotic jobscheduling with priorities for a simultaneous multithreading processor , 2002, SIGMETRICS '02.

[29]  François Bodin,et al.  A Machine Learning Approach to Automatic Production of Compiler Heuristics , 2002, AIMSA.

[30]  Dean M. Tullsen,et al.  Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[31]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[32]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[33]  Michael F. P. O'Boyle,et al.  Reducing Training Time in a One-Shot Machine Learning-Based Compiler , 2009, LCPC.