论文信息 - Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads

Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads

We propose a new thread synchronization heuristic called Min-SP/PC.Min-SP/PC handles function calls better than previous algorithms.Many instructions in SPMD programs are identical across threads.Many memory accesses are either uniform or affine across threads. Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one architecture recently proposed that shares instruction decoding and execution between threads running the same program in an SMT processor, thereby generalizing the approach followed by Graphics Processing Units to general-purpose processors. In this paper we propose new ways to expose redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristic that handles function calls better than previous approaches. Our heuristic only inspects the program counter and the stack frame to reconverge threads; hence, it is amenable to efficient and inexpensive hardware implementation. Second, we demonstrate that this heuristic is able to reveal the existence of substantial regularity in inter-thread memory access patterns. We validate our results on data-parallel applications from the PARSEC and SPLASH suites. Our new reconvergence heuristic increases the throughput of our MMT model by 7%, when compared to a previous, and substantially more complex approach, due to Long et al. Moreover, it gives us an effective way to increase regularity in memory accesses. We have observed that over 70% of simultaneous memory accesses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses.

[1] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[2] Philip J. Hatcher,et al. Compiling C* programs for a hypercube multicomputer , 1988, PPoPP 1988.

[3] Fernando Magno Quintão Pereira,et al. Data and Instruction Uniformity in Minimal Multi-threading , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[4] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[5] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[6] Frederica Darema,et al. A single-program-multiple-data computational model for EPEX/FORTRAN , 1988, Parallel Comput..

[7] Paola Bonizzoni,et al. An approximation algorithm for the shortest common supersequence problem: an experimental analysis , 2001, SAC.

[8] Jack L. Lo,et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[9] Sudhakar Yalamanchili,et al. SIMD re-convergence at thread frontiers , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11] Yao Zhang,et al. Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations , 2009, Euro-Par Workshops.

[12] Amirali Baniasadi,et al. Performance in GPU Architectures: Potentials and Distances , 2011 .

[13] Dongrui Fan,et al. Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[14] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[15] José González,et al. Thread fusion , 2008, Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08).