Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Current GPUs maintain high programmability by abstracting the SIMD nature of the hardware as independent concurrent threads of control with hardware responsible for generating predicate masks to utilize the SIMD hardware for different flows of control. This dynamic masking leads to poor utilization of SIMD resources when the control of different threads in the same SIMD group diverges. Prior research suggests that SIMD groups be formed dynamically by compacting a large number of threads into groups, mitigating the impact of divergence. To maintain hardware efficiency, however, the alignment of a thread to a SIMD lane is fixed, limiting the potential for compaction. We observe that control frequently diverges in a manner that prevents compaction because of the way in which the fixed alignment of threads to lanes is done. This paper presents an in-depth analysis on the causes for ineffective compaction. An important observation is that in many cases, control diverges because of programmatic branches, which do not depend on input data. This behavior, when combined with the default mapping of threads to lanes, severely restricts compaction. We then propose SIMD lane permutation (SLP) as an optimization to expand the applicability of compaction in such cases of lane alignment. SLP seeks to rearrange how threads are mapped to lanes to allow even programmatic branches to be compacted effectively, improving SIMD utilization up to 34% accompanied by a maximum 25% performance boost.

[1]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[2]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[3]  Mattan Erez,et al.  CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[4]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[5]  Nicolas Brunie,et al.  Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[6]  Thomas E. Anderson,et al.  High speed switch scheduling for local area networks , 1992, ASPLOS V.

[7]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[8]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10]  Matei Ripeanu,et al.  Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Nancy M. Amato,et al.  Quantifying the effectiveness of load balance algorithms , 2012, ICS '12.

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[14]  James E. Smith,et al.  Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Atilla Eryilmaz,et al.  Exploring the Throughput Boundaries of Randomized Schedulers in Wireless Networks , 2012, IEEE/ACM Transactions on Networking.

[17]  Donald F. Ferguson,et al.  Microeconomic algorithms for load balancing in distributed computer systems , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[18]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[19]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[20]  W. Dally,et al.  Efficient conditional operations for data-parallel architectures , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[21]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[22]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[23]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[24]  Lei Liu,et al.  A software memory partition approach for eliminating bank-level interference in multicore systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Ahmed Sameh,et al.  The Illiac IV system , 1972 .

[26]  Sudhakar Yalamanchili,et al.  SIMD re-convergence at thread frontiers , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Paolo Giaccone,et al.  Efficient Randomized Algorithms for Input-Queued Switch Scheduling , 2002, IEEE Micro.

[28]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[29]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30]  Mattan Erez,et al.  The dual-path execution model for efficient GPU control flow , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[31]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[32]  Sudhakar Yalamanchili,et al.  Characterization and Transformation of Unstructured Control Flow in GPU Applications , 2011 .

[33]  Kevin Skadron,et al.  Increasing memory miss tolerance for SIMD cores , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.