论文信息 - Emerald: Graphics Modeling for SoC Systems

Emerald: Graphics Modeling for SoC Systems

Mobile systems-on-chips (SoCs) have become ubiquitous computing platforms, and, in recent years, they have become increasingly heterogeneous and complex. A typical SoC includes CPUs, graphics processor units (GPUs), image processors, video encoders/decoders, AI engines, digital signal processors (DSPs) and 2D engines among others [33], [70], [71]. One of the most significant SoC units in terms of both off-chip memory bandwidth and SoC die area is the GPU. In this paper, we present Emerald, a simulator that builds on existing tools to provide a unified model for graphics and GPGPU applications. Emerald enables OpenGL (v4.5) and OpenGL ES (v3.2) shaders to run on GPGPU-Sim's timing model and is integrated with gem5 and Android to simulate full SoCs. Emerald thus provides a platform for studying system-level SoC interactions while including the impact of graphics. We present two case studies using Emerald. First, we use Emerald's full-system mode to highlight the importance of system-wide interactions by studying and analyzing memory organization and scheduling schemes for SoC systems. Second, we use Emerald's standalone mode to evaluate a novel mechanism for balancing the graphics shading work assigned to each GPU core.

Tor M. Aamodt | Ayub A. Gubran

[1] David Black-Schaffer,et al. A graphics tracing framework for exploring CPU+GPU memory systems , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[2] Kevin Skadron,et al. A flexible simulation framework for graphics architectures , 2004, Graphics Hardware.

[3] Onur Mutlu,et al. The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality In GPUs , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[4] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[5] Mahmut T. Kandemir,et al. Domain knowledge based energy management in handhelds , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6] Mor Harchol-Balter,et al. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[7] Henk Corporaal,et al. Locality-Aware CTA Clustering for Modern GPUs , 2017, ASPLOS.

[8] Ronald G. Dreslinski,et al. Sources of error in full-system simulation , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[9] Kevin Kai-Wei Chang,et al. DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators , 2016, ACM Trans. Archit. Code Optim..

[10] David R. Kaeli,et al. Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11] Jem Davies. The bifrost GPU architecture and the ARM Mali-G71 GPU , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[12] John Kim,et al. Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[13] Mattan Erez,et al. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[14] Stijn Eyerman,et al. An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[15] Jose-Maria Arnau,et al. TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems , 2013, ICS '13.

[16] Thomas F. Wenisch,et al. Simulating DRAM controllers for future system architecture exploration , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[17] Mahmut T. Kandemir,et al. VIP: Virtualizing IP chains on handheld platforms , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18] David A. Wood,et al. gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[19] Lei Yang,et al. Temporal Coherence Methods in Real‐Time Rendering , 2012, Comput. Graph. Forum.

[20] Mainak Chaudhuri,et al. Improving CPU Performance Through Dynamic GPU Access Throttling in CPU-GPU Heterogeneous Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[21] Jose-Maria Arnau,et al. Parallel frame rendering: Trading responsiveness for energy on a mobile GPU , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[22] Juan L. Aragón,et al. Early Visibility Resolution for Removing Ineffectual Computations in the Graphics Pipeline , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23] Mark Segal,et al. The OpenGL Graphics System: A Specification , 2004 .

[24] Andreas Sandberg,et al. NoMali: Simulating a realistic graphics driver stack using a stub GPU , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[25] Yuan Yao,et al. Aggregate Flow-Based Performance Fairness in CMPs , 2016, ACM Trans. Archit. Code Optim..

[26] Mahmut T. Kandemir,et al. GemDroid: a framework to evaluate mobile platforms , 2014, SIGMETRICS '14.

[27] Rami G. Melhem,et al. Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[28] Mahmut T. Kandemir,et al. Exploiting Core Criticality for Enhanced GPU Performance , 2016, SIGMETRICS.

[29] Mahmut T. Kandemir,et al. Short-Circuiting Memory Traffic in Handheld Platforms , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[30] Jose-Maria Arnau,et al. Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[31] Antonio González,et al. Visibility Rendering Order: Improving Energy Efficiency on Mobile GPUs through Frame Coherence , 2019, IEEE Transactions on Parallel and Distributed Systems.

[32] Onur Mutlu,et al. The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[33] Carlos González,et al. ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[34] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[35] Gu-Yeon Wei,et al. Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[37] Kevin Kai-Wei Chang,et al. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[38] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[39] Mor Harchol-Balter,et al. ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[40] Cheol Hong Kim,et al. A dynamic CTA scheduling scheme for massive parallel computing , 2017, Cluster Computing.