An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

In this paper, we propose ParallelXL, an architectural framework for building application-specific parallel accelerators with low manual effort. The framework introduces a task-based computation model with explicit continuation passing to support dynamic parallelism in addition to static parallelism. In contrast, today's high-level design frameworks for accelerators focus on static data-level or thread-level parallelism that can be identified and scheduled at design time. To realize the new computation model, we develop an accelerator architecture that efficiently handles dynamic task generation and scheduling as well as load balancing through work stealing. The architecture is general enough to support many dynamic parallel constructs such as fork-join, data-dependent task spawning, and arbitrary nesting and recursion of tasks, as well as static parallel patterns. We also introduce a design methodology that includes an architectural template that allows easily creating parallel accelerators from high-level descriptions. The proposed framework is studied through an FPGA prototype as well as detailed simulations. Evaluation results show that the framework can generate high-performance accelerators targeting FPGAs for a wide range of parallel algorithms and achieve an average of 4.0x speedup over an eight-core out-of-order processor (24.1x over a single core), while being 11.8x more energy efficient.

[1]  Shunning Jiang,et al.  Mamba: Closing the Performance Gap in Productive Hardware Development Frameworks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[2]  Vamsi Boppana,et al.  A 16-nm Multiprocessing System-on-Chip Field-Programmable Gate Array Platform , 2016, IEEE Micro.

[3]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[4]  Yao Wang,et al.  Aggressive pipelining of irregular applications on reconfigurable hardware , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[6]  C. A. R. Hoare,et al.  Algorithm 64: Quicksort , 1961, Commun. ACM.

[7]  Selim G. Akl,et al.  Optimal Parallel Merging and Sorting Without Memory Conflicts , 1987, IEEE Transactions on Computers.

[8]  Satnam Singh,et al.  Kiwi: Synthesis of FPGA Circuits from Parallel Programs , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.

[9]  Ioana Burcea,et al.  A compiler and runtime for heterogeneous computing , 2012, DAC Design Automation Conference 2012.

[10]  Christopher Batten,et al.  PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[13]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[14]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[15]  Gu-Yeon Wei,et al.  MachSuite: Benchmarks for accelerator design and customized architectures , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[16]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[17]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[18]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[19]  Stephen D. Brown,et al.  From Pthreads to Multicore Hardware Systems in LegUp High-Level Synthesis for FPGAs , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[20]  Tao Chen,et al.  Efficient data supply for hardware accelerators with prefetching and access/execute decoupling , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[22]  Mike Hutton Stratix® 10: 14nm FPGA delivering 1GHz , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[23]  Kunle Olukotun,et al.  Automatic Generation of Efficient Accelerators for Reconfigurable Hardware , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[24]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..

[25]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[26]  George A. Constantinides,et al.  A Case for Work-stealing on FPGAs with OpenCL Atomics , 2016, FPGA.

[27]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.