A Coarse Grain Reconfigurable Array ( CGRA ) for Statically Scheduled Data Flow Computing

This paper argues a case for the use of coarse grained reconfigurable array (CGRA) architectures for the efficient acceleration of the data flow computations used in deep neural network training and inferencing. The paper discusses the problems with other parallel acceleration systems such as massively parallel processor arrays (MPPAs) and heterogeneous systems based on CUDA and OpenCL, and proposes that CGRAs with autonomous computing features deliver improved performance and computational efficiency. The machine learning compute appliance that Wave Computing is developing executes data flow graphs using multiple clock-less, CGRA-based System on Chips (SoCs) each containing 16,000 processing elements (PEs). This paper describes the tools needed for efficient compilation of data flow graphs to the CGRA architecture, and outlines Wave Computing’s WaveFlow software (SW) framework for the online mapping of models from popular workflows like Tensorflow, MXNet and Caffe. The Parallel Programming Problem CPUs have not delivered a substantial improvement in the execution performance of a single-thread C-program for the last decade. A typical PC in 2007 contained a CPU running at 3.4GHz with 2GB DRAM. Today, a typical computer will have a CPU running at a similar clock frequency and about 8 GB DRAM. Once processors hit this “clock wall”, multi-core (as well as many-core and MPPAs) are used to increase the compute performance in systems without increasing the clock speed. One problem with these systems is that they are difficult to program in such a way that linear speed up is achieved. The challenge of developing a compiler that exploits the concurrency in a C program and partitions it efficiently across MPPA architectures is non-trivial and remains an open problem. The programmer must rewrite the program using frameworks like OpenMP for shared memory or Message Passing Interface (MPI). Refactoring a C-program for efficient multi-threaded execution is a non-trivial exercise and requires techniques taught in graduate programming courses. These are suitable for a modest number of processor cores in a multi-core system, however for many applications, they are not easily scalable to 100 or 1,000 cores, or more. The author of this paper first outlined this problem in 2010 [1]. Of course, this doesn’t prevent people from trying (consider Kalray, Epiphany V, Tilera and KnuEdge). Even if it were possible to map computations across a large number of homogeneous cores, the dynamic distribution of memory and communication between the processors eventually limits the scalability of any compiled solution. Acceleration Using Heterogeneous Architectures Heterogeneous systems promise increased performance from a simpler programming model that enables a sequential C program to make calls to coprocessors that provide parallel speed-up. A control program executes on a CPU that uses a runtime API to initiate parallel execution of threads in an accelerator to speed up certain tasks like Basic Linear Algebra Subprograms (BLAS). CUDA and OpenCL are two examples of widely-used runtime APIs that enable heterogeneous computing. OpenCL aims to span all types of accelerator architectures from GPUs to FPGAs whereas CUDA is only for Nvidia GPUs. Most of the current machine learning systems use the heterogeneous computing architecture just described. One assumption made is that the transfer of control and data between the CPU and the accelerator is relatively insignificant (i.e. it carries a low overhead relative to the computation speed-up provided by the accelerator) and for some high throughput applications like image processing this can be true. Accelerators in a heterogeneous system either transfer blocks of data to and from the CPU main memory, or they use a shared memory model. These solutions are not scalable because the accelerator is always tethered to the control code and main memory on the CPU. Eventually the communication between the CPU and the accelerator (typically over PCIe) becomes the bottleneck that limits scalability. There are new chip-to-chip interconnect proposals to address this (like NVLink from Nvidia [2] and CCIX [3]). These also provide coherency enabling the sharing of memory between the CPU and