Emerald: Graphics Modeling for SoC Systems

Mobile systems-on-chips (SoCs) have become ubiquitous computing platforms, and, in recent years, they have become increasingly heterogeneous and complex. A typical SoC includes CPUs, graphics processor units (GPUs), image processors, video encoders/decoders, AI engines, digital signal processors (DSPs) and 2D engines among others [33], [70], [71]. One of the most significant SoC units in terms of both off-chip memory bandwidth and SoC die area is the GPU. In this paper, we present Emerald, a simulator that builds on existing tools to provide a unified model for graphics and GPGPU applications. Emerald enables OpenGL (v4.5) and OpenGL ES (v3.2) shaders to run on GPGPU-Sim's timing model and is integrated with gem5 and Android to simulate full SoCs. Emerald thus provides a platform for studying system-level SoC interactions while including the impact of graphics. We present two case studies using Emerald. First, we use Emerald's full-system mode to highlight the importance of system-wide interactions by studying and analyzing memory organization and scheduling schemes for SoC systems. Second, we use Emerald's standalone mode to evaluate a novel mechanism for balancing the graphics shading work assigned to each GPU core.

[1]  David Black-Schaffer,et al.  A graphics tracing framework for exploring CPU+GPU memory systems , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[2]  Kevin Skadron,et al.  A flexible simulation framework for graphics architectures , 2004, Graphics Hardware.

[3]  Onur Mutlu,et al.  The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality In GPUs , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[4]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[5]  Mahmut T. Kandemir,et al.  Domain knowledge based energy management in handhelds , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Henk Corporaal,et al.  Locality-Aware CTA Clustering for Modern GPUs , 2017, ASPLOS.

[8]  Ronald G. Dreslinski,et al.  Sources of error in full-system simulation , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[9]  Kevin Kai-Wei Chang,et al.  DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators , 2016, ACM Trans. Archit. Code Optim..

[10]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Jem Davies The bifrost GPU architecture and the ARM Mali-G71 GPU , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[12]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[13]  Mattan Erez,et al.  A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[14]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[15]  Jose-Maria Arnau,et al.  TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems , 2013, ICS '13.

[16]  Thomas F. Wenisch,et al.  Simulating DRAM controllers for future system architecture exploration , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[17]  Mahmut T. Kandemir,et al.  VIP: Virtualizing IP chains on handheld platforms , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[19]  Lei Yang,et al.  Temporal Coherence Methods in Real‐Time Rendering , 2012, Comput. Graph. Forum.

[20]  Mainak Chaudhuri,et al.  Improving CPU Performance Through Dynamic GPU Access Throttling in CPU-GPU Heterogeneous Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[21]  Jose-Maria Arnau,et al.  Parallel frame rendering: Trading responsiveness for energy on a mobile GPU , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[22]  Juan L. Aragón,et al.  Early Visibility Resolution for Removing Ineffectual Computations in the Graphics Pipeline , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]  Mark Segal,et al.  The OpenGL Graphics System: A Specification , 2004 .

[24]  Andreas Sandberg,et al.  NoMali: Simulating a realistic graphics driver stack using a stub GPU , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[25]  Yuan Yao,et al.  Aggregate Flow-Based Performance Fairness in CMPs , 2016, ACM Trans. Archit. Code Optim..

[26]  Mahmut T. Kandemir,et al.  GemDroid: a framework to evaluate mobile platforms , 2014, SIGMETRICS '14.

[27]  Rami G. Melhem,et al.  Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[28]  Mahmut T. Kandemir,et al.  Exploiting Core Criticality for Enhanced GPU Performance , 2016, SIGMETRICS.

[29]  Mahmut T. Kandemir,et al.  Short-Circuiting Memory Traffic in Handheld Platforms , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Jose-Maria Arnau,et al.  Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[31]  Antonio González,et al.  Visibility Rendering Order: Improving Energy Efficiency on Mobile GPUs through Frame Coherence , 2019, IEEE Transactions on Parallel and Distributed Systems.

[32]  Onur Mutlu,et al.  The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[33]  Carlos González,et al.  ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[34]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[35]  Gu-Yeon Wei,et al.  Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[37]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[38]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[39]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[40]  Cheol Hong Kim,et al.  A dynamic CTA scheduling scheme for massive parallel computing , 2017, Cluster Computing.