Toward a Microarchitecture for Efficient Execution of Irregular Applications

Given the increasing importance of efficient data-intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns often found in these algorithms. Applications and algorithms that do not exhibit spatial and temporal memory request locality induce high latency and low memory bandwidth due to the high cache miss rate. In response to the performance penalties inherently present in applications with irregular memory accesses, we introduce a GoblinCore-64 (GC64) architecture and a unique memory hierarchy that are explicitly designed to exploit memory performance from irregular memory access patterns. GC64 provides a pressure-driven hardware-managed concurrency control to minimize pipeline stalls and lower the latency of context switches. A novel memory coalescing model is also introduced to enhance the performance of memory systems via request aggregations. We have evaluated the performance benefits of our approach using a series of 24 benchmarks and the results show nearly 50% memory request reductions and a performance acceleration of up to 14.6×.

[1]  Keith D. Cooper,et al.  Improvements to graph coloring register allocation , 1994, TOPL.

[2]  Keith D. Cooper,et al.  Stochastic instruction scheduling , 2000 .

[3]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[4]  Allan Porterfield,et al.  The Tera computer system , 1990, ICS '90.

[5]  Krste Asanovic,et al.  The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7 , 2015 .

[6]  Roberto Castañeda Lozano,et al.  Constraint-Based Register Allocation and Instruction Scheduling , 2012, CP.

[7]  Todd A. Proebsting Code Generation Techniques , 1992 .

[8]  Yong Chen,et al.  Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture , 2016, MEMSYS.

[9]  Guang R. Gao,et al.  Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor , 2009, Euro-Par.

[10]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[11]  Han Li,et al.  A study towards optimal data layout for GPU computing , 2012, MSPC '12.

[12]  David Gordon Bradlee,et al.  Retargetable instruction scheduling for pipelined processors , 1991 .

[13]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[14]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[15]  Carole-Jean Wu,et al.  PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  David Patterson,et al.  An Agile Approach to Building RISC-V Microprocessors , 2016, IEEE Micro.

[17]  John D. Leidel GoblinCore-64: A scalable, open architecture for data intensive high performance computing , 2017 .

[18]  R. Weisberg A-N-D , 2011 .

[19]  José E. Moreira,et al.  Dissecting Cyclops: a detailed analysis of a multithreaded architecture , 2003, CARN.

[20]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[21]  Scott A. Mahlke,et al.  WarpPool: Sharing requests with inter-warp coalescing for throughput processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[23]  J. Dongarra,et al.  HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems∗ , 2015 .

[24]  Mateo Valero,et al.  Dynamic transaction coalescing , 2014, Conf. Computing Frontiers.

[25]  John D. Leidel,et al.  Memory Coalescing for Hybrid Memory Cube , 2018, ICPP.

[26]  Thomas F. Wenisch,et al.  Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27]  Maya Gokhale,et al.  Hybrid memory cube performance characterization on data-centric workloads , 2015, IA3@SC.

[28]  Isom L. Crawford,et al.  Software Optimization for High Performance Computers , 2000 .

[29]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[30]  Mateo Valero,et al.  A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality , 1995, International Conference on Supercomputing.

[31]  Guang R. Gao,et al.  Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture , 2006, 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS'06).

[32]  Paul Rosenfeld,et al.  Performance Exploration of the Hybrid Memory Cube , 2014 .

[33]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[34]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[35]  Yong Chen,et al.  Pressure-Driven Hardware Managed Thread Concurrency for Irregular Applications , 2017, IA3@SC.

[36]  R. Hornung,et al.  HYDRODYNAMICS CHALLENGE PROBLEM , 2011 .

[37]  Simon Kahan,et al.  Tera Hardware Software Cooperation , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[38]  Yong Chen,et al.  GoblinCore-64: A RISC-V Based Architecture for Data Intensive Computing , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).

[39]  Susan J. Eggers,et al.  The Marion system for retargetable instruction scheduling , 1991, PLDI '91.

[40]  Bahar Asgari,et al.  Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube , 2017, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[41]  Andrew Waterman,et al.  The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0 , 2014 .

[42]  Yunsup Lee,et al.  The RISC-V Instruction Set Manual , 2014 .

[43]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[44]  David Mizell,et al.  Early experiences with large-scale Cray XMT systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[45]  Keith D. Cooper,et al.  Tailoring graph-coloring register allocation for runtime compilation , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[46]  John D. Leidel,et al.  CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[47]  Geoffrey Ingram Taylor,et al.  The formation of a blast wave by a very intense explosion. - II. The atomic explosion of 1945 , 1950, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[48]  Vikram S. Adve,et al.  The LLVM Instruction Set and Compilation Strategy , 2002 .

[49]  Antonino Tumeo,et al.  MAC: Memory Access Coalescer for 3D-Stacked Memory , 2019, ICPP.

[50]  Bo Wu,et al.  Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[51]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[52]  David H. Bailey,et al.  NAS parallel benchmark results , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.