Architecture exploration of recent GPUs to analyze the efficiency of hardware resources

This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.

[1]  Rami G. Melhem,et al.  SAWS: Synchronization aware GPGPU warp scheduling for multiple independent warp schedulers , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Jong-Myon Kim,et al.  Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information , 2018, PDCAT.

[3]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[5]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[6]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Jong-Myon Kim,et al.  Application Characteristics-Aware Sporadic Cache Bypassing for high performance GPGPUs , 2018, J. Parallel Distributed Comput..

[8]  Cheol Hong Kim,et al.  A dynamic CTA scheduling scheme for massive parallel computing , 2017, Cluster Computing.

[9]  Jong-Myon Kim,et al.  Early miss prediction based periodic cache bypassing for high performance GPUs , 2017, Microprocess. Microsystems.

[10]  Wu-chun Feng,et al.  To GPU synchronize or not GPU synchronize? , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[11]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[15]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[17]  Cheol Hong Kim,et al.  Memory Contention Aware Power Management for High Performance GPUs , 2018, PDCAT.

[18]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[19]  Tor M. Aamodt,et al.  Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[20]  Carole-Jean Wu,et al.  CAWS: Criticality-aware warp scheduling for GPGPU workloads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).