How Many Threads to Spawn during Program Multithreading?

Thread-level program parallelization is key for exploiting the hardware parallelism of the emerging multi-core systems. Several techniques have been proposed for program multithreading. However, the existing techniques do not address the following key issues associated with multithread execution of a given program: (a) Whether multithreaded execution is faster than sequential execution; (b) How many threads to spawn during program multithreading. In this paper, we address the above limitations. Specifically, we propose a novel approach - T-OPT- to determine how many threads to spawn during multithreaded execution of a given program region. The latter helps to check under-subscribing and oversubscribing of the hardware resources. This in turn facilitates exploitation on higher level of thread-level parallelism (TLP) than what can be achieved using the state-of-the-art. We show that, from program dependence standpoint, use of larger number of threads than advocated by the proposed approach does not yield higher degree of TLP. We present a couple of case studies and results using kernels, extracted from open source codes, to demonstrate the efficacy of our techniques on a real machine.

[1]  Alexander V. Veidenbaum,et al.  Synchronization optimizations for efficient execution on multi-cores , 2009, ICS '09.

[2]  Thomas E. Anderson,et al.  The performance implications of thread management alternatives for shared-memory multiprocessors , 1989, SIGMETRICS '89.

[3]  Girija J. Narlikar,et al.  Scheduling threads for low space requirement and good locality , 1999, SPAA '99.

[4]  Ronald Gary Cytron Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing) , 1984 .

[5]  Peng Wu,et al.  Compiler-Driven Dependence Profiling to Guide Program Parallelization , 2008, LCPC.

[6]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[7]  Boris Weissman,et al.  Performance counters and state sharing annotations: a unified approach to thread locality , 1998, ASPLOS VIII.

[8]  B. Ramakrishna Rau,et al.  Instruction-level parallel processing: History, overview, and perspective , 2005, The Journal of Supercomputing.

[9]  Camille C. Price Task allocation in distributed systems: A survey of practical strategies , 1982, ACM '82.

[10]  Toshio Nakatani,et al.  Making Compaction-Based Parallelization Affordable , 1993, IEEE Trans. Parallel Distributed Syst..

[11]  Utpal K. Banerjee Dependence Analysis , 2011, Encyclopedia of Parallel Computing.

[12]  Shahid H. Bokhari,et al.  Dual Processor Scheduling with Dynamic Reassignment , 1979, IEEE Transactions on Software Engineering.

[13]  J. Fisher,et al.  Vliw Architectures - An Inevitable Standard for the Future , 1990 .

[14]  Alexander Aiken,et al.  Optimal loop parallelization , 1988, PLDI '88.

[15]  Michael Wolfe The definition of dependence distance , 1994, TOPL.

[16]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[17]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[18]  Kemal Ebcioglu,et al.  A compilation technique for software pipelining of loops with conditional jumps , 1987, MICRO 20.

[19]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[20]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[21]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[22]  Alex Aiken,et al.  Compaction-Based Parallelization , 1988 .

[23]  Alexandru Nicolau Parallelism, memory anti-aliasing and correctness for trace scheduling compilers (disambiguation, flow-analysis, compaction) , 1984 .

[24]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[25]  G. H. Barnes,et al.  A controllable MIMD architecture , 1986 .

[26]  Henry M. Levy,et al.  The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors , 1989, IEEE Trans. Computers.

[27]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[28]  Alexandru Nicolau,et al.  Techniques for efficient placement of synchronization primitives , 2009, PPoPP '09.

[29]  Alain Billionnet,et al.  An efficient algorithm for a task allocation problem , 1992, JACM.

[30]  J A Fisher,et al.  Instruction-Level Parallel Processing , 1991, Science.

[31]  Bogong Su,et al.  URPR—An extension of URCR for software pipelining , 1986, MICRO 19.

[32]  Kemal Ebcioglu,et al.  VLIW compilation techniques in a superscalar environment , 1994, PLDI '94.

[33]  Alexandru Nicolau,et al.  On the evaluation and extraction of thread-level parallelism in ordinary programs , 2008 .