Loop Scheduling for Multithreaded Processors

The presence of multiple active threads on the same processor can mask latency by rapid context switching, but it can adversely affect performance due to competition for shared datapath resources. In this paper we present Macro Software Pipelining (MSWP), a loop scheduling technique for multithreaded processors, which is based on the loop distribution transformation for loop pipelining. MSWP constructs loop schedules by partitioning the loop body into tasks and assigning each task to a thread that executes all iterations for that particular task. MSWP is applied top-down on a hierarchical program representation, and utilizes thread-level speculation for maximal exploitation of parallelism. We tested MSWP on a multithreaded architectural model, Coral 2000, using synthetic and SPEC benchmarks. We obtained speedups of up to 30% with respect to highly optimized superblock-based schedules on loops with unpredictable branches, and a speedup of up to 25% on perl, a highly sequential SPEC95 integer benchmark.

[1]  B. Ramakrishna Rau,et al.  Instruction-level parallel processing: History, overview, and perspective , 2005, The Journal of Supercomputing.

[2]  Antonio González,et al.  Speculative multithreaded processors , 1998, ICS '98.

[3]  Josep Torrellas,et al.  Removing architectural bottlenecks to the scalability of speculative parallelization , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[4]  Antonia Zhai,et al.  Compiler optimization of scalar value communication between speculative threads , 2002, ASPLOS X.

[5]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[6]  Josep Torrellas,et al.  A clustered approach to multithreaded processors , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[7]  David A. Padua,et al.  High-Speed Multiprocessors and Compilation Techniques , 1980, IEEE Transactions on Computers.

[8]  Milind Girkar Functional parallelism: theoretical foundations and implementation , 1992 .

[9]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[10]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[11]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[12]  Daniel M. Lavery,et al.  Modulo Scheduling for Control-Intensive General-Purpose Programs , 1997 .

[13]  Kevin O'Brien,et al.  Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading , 1995, PACT.

[14]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[15]  James P. Laudon,et al.  Architectural and Implementation Tradeoffs for Multiple-Context Processors , 1995 .

[16]  Jenn-Yuan Tsai,et al.  The superthreaded architecture: thread pipelining with run-time data dependence checking and control speculation , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[17]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .