Efficient Data Supply for Parallel Heterogeneous Architectures

Decoupling techniques have been proposed to reduce the amount of memory latency exposed to high-performance accelerators as they fetch data. Although decoupled access-execute (DAE) and more recent decoupled data supply approaches offer promising single-threaded performance improvements, little work has considered how to extend them into parallel scenarios. This article explores the opportunities and challenges of designing parallel, high-performance, resource-efficient decoupled data supply systems. We propose Mercury, a parallel decoupled data supply system that utilizes thread-level parallelism for high-throughput data supply with good portability attributes. Additionally, we introduce some microarchitectural improvements for data supply units to efficiently handle long-latency indirect loads.

[1]  Alasdair Rawsthorne,et al.  The effectiveness of decoupling , 1993, ICS '93.

[2]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[3]  Margaret Martonosi,et al.  DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[5]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[6]  Sanjay J. Patel,et al.  OUTRIDER: Efficient memory latency tolerance with decoupled strands , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[7]  Tao Chen,et al.  Efficient data supply for hardware accelerators with prefetching and access/execute decoupling , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[9]  Huiyang Zhou,et al.  Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[10]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[11]  Weifeng Zhang,et al.  Accelerating and Adapting Precomputation Threads for Effcient Prefetching , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[12]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[13]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[14]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[15]  David A. Wood,et al.  WiDGET: Wisconsin decoupled grid execution tiles , 2010, ISCA.

[16]  Karthikeyan Sankaralingam,et al.  Stream-dataflow acceleration , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[17]  Jung Ho Ahn,et al.  CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[18]  Eric Rotenberg,et al.  A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[19]  Karthikeyan Sankaralingam,et al.  Efficient execution of memory access phases using dataflow specialization , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[20]  Margaret Martonosi,et al.  Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous Architectures , 2017, ACM Trans. Archit. Code Optim..

[21]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[22]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[23]  Christoforos E. Kozyrakis,et al.  Learning Memory Access Patterns , 2018, ICML.

[24]  Stijn Eyerman,et al.  Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance , 2014, IEEE Computer Architecture Letters.

[25]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[26]  David Black-Schaffer,et al.  Fix the code. Don't tweak the hardware: A new compiler approach to Voltage-Frequency scaling , 2014, CGO '14.

[27]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[28]  Francisco J. Cazorla,et al.  A Flexible Heterogeneous Multi-Core Architecture , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[29]  Onur Mutlu,et al.  Continuous runahead: Transparent hardware acceleration for memory intensive workloads , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Alasdair Rawsthorne,et al.  Compiling and Optimizing for Decoupled Architectures , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[31]  Marc Tremblay,et al.  Simultaneous speculative threading: a novel pipeline architecture implemented in sun's rock processor , 2009, ISCA '09.

[32]  Amir Roth,et al.  BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[33]  Mateo Valero,et al.  A decoupled KILO-instruction processor , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[34]  A. Djabelkhir,et al.  Characterization of embedded applications for decoupled processor architecture , 2003, 2003 IEEE International Conference on Communications (Cat. No.03CH37441).

[35]  Thomas F. Wenisch,et al.  Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[36]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[37]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[38]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[39]  Jan Reineke,et al.  Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.

[40]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[41]  Mateo Valero,et al.  Toward kilo-instruction processors , 2004, TACO.

[42]  Michael C. Huang,et al.  DRUT: An Efficient Turbo Boost Solution via Load Balancing in Decoupled Look-Ahead Architecture , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[43]  Michael C. Huang,et al.  A performance-correctness explicitly-decoupled architecture , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[44]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[45]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[46]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[47]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[48]  Yale N. Patt,et al.  MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[49]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[50]  Wei-Chung Hsu,et al.  Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[51]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[52]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).