HARP: Harnessing inactive threads in many-core processors

SIMT accelerators are equipped with thousands of computational resources. Conventional accelerators, however, fail to fully utilize available resources due to branch and memory divergences. This underutilization is manifested in two underlying inefficiencies: pipeline width underutilization and pipeline depth underutilization. Width underutilization occurs when SIMD execution units are not entirely utilized due to branch divergences. This affects lane activity and results in SIMD inefficiency. Depth underutilization takes place when the pipeline runs out of active threads and is forced to leave pipeline stages idle. This work addresses both inefficiencies by harnessing inactive threads available to the pipeline. We introduce Harnessing inActive thReads in many-core Processors (or simply HARP) to improve width and depth utilization in accelerators. We show how using inactive yet ready threads can enhance performance. Moreover, we investigate implementation details and study microarchitectural changes needed to build a HARP-enhanced accelerator. Furthermore, we evaluate HARP under a variety of microarchitectural design points. We measure the area overhead associated with HARP and compare to conventional alternatives. Under Fermi-like GPUs, we show that HARP provides 10p speedup on average (maximum of 1.6X) at the cost of 3.5p area overhead. Our analysis shows that HARP performs better under narrower SIMD and shorter pipelines.

[1]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[2]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Kai Lu,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.

[4]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[6]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[7]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[8]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[9]  Mattan Erez,et al.  CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[10]  Mattan Erez,et al.  The dual-path execution model for efficient GPU control flow , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[11]  Tor M. Aamodt,et al.  Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[14]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[16]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[17]  Sylvain Collange Stack-less SIMT reconvergence at low cost , 2011 .

[18]  Sudhakar Yalamanchili,et al.  Characterization and transformation of unstructured control flow in bulk synchronous GPU applications , 2012, Int. J. High Perform. Comput. Appl..

[19]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[20]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[21]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[22]  Nicolas Brunie,et al.  Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[23]  Tor M. Aamodt,et al.  Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Sudhakar Yalamanchili,et al.  SIMD re-convergence at thread frontiers , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Matei Ripeanu,et al.  Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.