Scheduling Weakly Consistent C Concurrency for Reconfigurable Hardware

Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations (‘atomics’), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This article explores how these algorithms can be compiled from C to reconfigurable hardware via high-level synthesis(HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional intra-thread constraints among the memory operations. In addition, we show that we can support the pipelining of loops containing atomics by injecting further inter-iteration constraints. We implement our approach on two constraint-based scheduling HLS tools: LegUp 4.0 and LegUp 5.1. We extend both tools to support two memory models that are capable of synthesising atomics correctly. The first memory model only supports sequentially consistent (SC) atomics and the second supports weakly consistent (‘weak’) atomics as defined by the 2011 revision of the C standard. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many multi-threaded algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics in accordance with the C standard. A case study on a circular buffer suggests that on average circuits synthesised from programs that schedule atomics correctly can be 6x faster than an existing lock-based implementation of atomics, that weak atomics can yield a further 1.3x speedup, and that pipelining can yield a further 1.3x speedup.

[1]  Sarita V. Adve,et al.  Chasing Away RAts: Semantics and evaluation for relaxed atomics on heterogeneous systems , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[2]  George A. Constantinides,et al.  A Case for Work-stealing on FPGAs with OpenCL Atomics , 2016, FPGA.

[3]  Albert Cohen,et al.  Correct and efficient work-stealing for weak memory models , 2013, PPoPP '13.

[4]  Daniel Jackson,et al.  Software Abstractions - Logic, Language, and Analysis , 2006 .

[5]  MIKE DODDS,et al.  Compositional Verification of Relaxed-Memory Program Transformations , 2017 .

[6]  Satnam Singh,et al.  Kiwi: Synthesis of FPGA Circuits from Parallel Programs , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.

[7]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[8]  Vincent Gramoli,et al.  More than you ever wanted to know about synchronization: synchrobench, measuring the impact of the synchronization on concurrent algorithms , 2015, PPoPP.

[9]  Jason Helge Anderson,et al.  Resource and memory management techniques for the high-level synthesis of software threads into parallel FPGA hardware , 2015, 2015 International Conference on Field Programmable Technology (FPT).

[10]  Alessandro Cilardo,et al.  Efficient and scalable OpenMP-based system-level design , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[11]  Zhiru Zhang,et al.  SDC-based modulo scheduling for pipeline synthesis , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[12]  Adrian Park,et al.  Designing Modular Hardware Accelerators in C with ROCCC 2.0 , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[13]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[14]  Jason Helge Anderson,et al.  Modulo SDC scheduling with recurrence minimization in high-level synthesis , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[15]  John Wickerson,et al.  Remote-scope promotion: clarified, rectified, and verified , 2015, OOPSLA.

[16]  Daniel Gajski,et al.  Introduction to high-level synthesis , 1994, IEEE Design & Test of Computers.

[17]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[18]  Weng-Fai Wong,et al.  Generating hardware from OpenMP programs , 2006, 2006 IEEE International Conference on Field Programmable Technology.

[19]  George A. Constantinides,et al.  Automatically comparing memory consistency models , 2017, POPL.

[20]  George A. Constantinides,et al.  Hardware Synthesis of Weakly Consistent C Concurrency , 2017, FPGA.

[21]  Peter Sewell,et al.  Mathematizing C++ concurrency , 2011, POPL '11.

[22]  Jason Helge Anderson,et al.  From software threads to parallel hardware in high-level synthesis for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[23]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[24]  Jason Helge Anderson,et al.  The Effect of Compiler Optimizations on High-Level Synthesis for FPGAs , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[25]  Rajeev Alur,et al.  Litmus tests for comparing memory consistency models: How long do they need to be? , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[26]  Jason Cong,et al.  An efficient and versatile scheduling algorithm based on SDC formulation , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[27]  Kermin Fleming,et al.  LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[28]  Scott A. Mahlke,et al.  Characterizing the impact of predicated execution on branch prediction , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Jason Cong,et al.  A Study on the Impact of Compiler Optimizations on High-Level Synthesis , 2012, LCPC.