A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms

Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine-grain producer/consumer interaction with iteratively changing depen-dence distance, reuse rate, and memory access patterns. This makes multi-threading impractical due to fine-grain synchronization, and vectorization ineffective due to the non-rectangular iteration domain. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first develop a novel execution model, inductive dataflow, where inductive dependence patterns and memory access patterns (streams) are first-order primitives. Second, we develop a hybrid spatial architecture combining systolic and tagged dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6×-37×. Compared to state-of-the-art spatial architectures, REVEL is mean 3× faster. Compared to a set of ASICs, REVEL is only 2× the power and half the area.

[1]  Shanzhi Chen,et al.  The requirements, challenges, and technologies for 5G of terrestrial mobile telecommunication , 2014, IEEE Communications Magazine.

[2]  Kenneth A. Ross,et al.  Q100: the architecture and design of a database processing unit , 2014, ASPLOS.

[3]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[4]  P. B. Darwood,et al.  LMMSE chip equalisation for 3GPP WCDMA downlink receivers with channel coding , 2001, ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No.01CH37240).

[5]  André DeHon,et al.  MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[6]  Ali Saidi,et al.  The Reconfigurable Streaming Vector Processor (RSVP , 2003 .

[7]  Alec Roelke RISC5: Implementing the RISC-V ISA in gem5 , 2017 .

[8]  Scott A. Mahlke,et al.  CGRA express: accelerating execution using dynamic operation fusion , 2009, CASES '09.

[9]  Rudy Lauwereins,et al.  Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling , 2003, DATE.

[10]  Jason Cong,et al.  A Fully Pipelined and Dynamically Composable Architecture of CGRA , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[11]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[12]  Rudy Lauwereins,et al.  ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[13]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[14]  George Carayannis,et al.  Speech enhancement from noise: A regenerative approach , 1991, Speech Commun..

[15]  Leibo Liu,et al.  Exploiting Parallelism of Imperfect Nested Loops on Coarse-Grained Reconfigurable Architectures , 2016, IEEE Transactions on Parallel and Distributed Systems.

[16]  C. Batten,et al.  Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Robert A. van de Geijn,et al.  Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator , 2014, IEEE Transactions on Computers.

[18]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[19]  Kunle Olukotun,et al.  Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[21]  Kunle Olukotun,et al.  Generating Configurable Hardware from Parallel Patterns , 2015, ASPLOS.

[22]  Yoav Etsion,et al.  Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Praveen Raghavan,et al.  Energy-Efficient Communication Processors: Design and Implementation for Emerging Wireless Systems , 2013 .

[24]  P. Glenn Gulak,et al.  A low-complexity high-speed QR decomposition implementation for MIMO receivers , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[25]  Scott A. Mahlke,et al.  Edge-centric modulo scheduling for coarse-grained reconfigurable architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[26]  Håkan Johansson,et al.  Polyphase Decomposition of Digital Fractional-Delay Filters , 2015, IEEE Signal Processing Letters.

[27]  Ruijie Zhao WLS design of centro-symmetric 2-D FIR filters using matrix iterative algorithm , 2015, 2015 IEEE International Conference on Digital Signal Processing (DSP).

[28]  Karthikeyan Sankaralingam,et al.  Stream-dataflow acceleration , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[29]  Robert A. van Engelen,et al.  Efficient Symbolic Analysis for Optimizing Compilers , 2001, CC.

[30]  Karthikeyan Sankaralingam,et al.  Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[31]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[32]  Kunle Olukotun,et al.  REMARC : Reconfigurable Multimedia Array Coprocessor , 1999 .

[33]  Yu Peng,et al.  Improving Nested Loop Pipelining on Coarse-Grained Reconfigurable Architectures , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[34]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[35]  Aviral Shrivastava,et al.  REGIMap: Register-aware application mapping on Coarse-Grained Reconfigurable Architectures (CGRAs) , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[36]  Seth Copen Goldstein,et al.  Spatial computation , 2004, ASPLOS XI.

[37]  F. Mintzer,et al.  On half-band, third-band, and Nth-band FIR filters and their design , 1982 .

[38]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[39]  A. Happonen,et al.  DSP implementation of Cholesky decomposition , 2006, Joint IST Workshop on Mobile Future, 2006 and the Symposium on Trends in Communications. SympoTIC '06..

[40]  H. T. Kung,et al.  The Warp Computer: Architecture, Implementation, and Performance , 1987, IEEE Transactions on Computers.

[41]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[42]  C. Nicol A Coarse Grain Reconfigurable Array ( CGRA ) for Statically Scheduled Data Flow Computing , 2017 .

[43]  Karthikeyan Sankaralingam,et al.  A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.

[44]  Seth Copen Goldstein,et al.  PipeRench: a co/processor for streaming multimedia acceleration , 1999, ISCA.

[45]  Christopher Batten,et al.  Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..

[46]  Yoav Etsion,et al.  Control flow coalescing on a hybrid dataflow/von Neumann GPGPU , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[48]  Jian Weng,et al.  Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign , 2018, PACT.

[49]  Steven Swanson,et al.  Instruction scheduling for a tiled dataflow architecture , 2006, ASPLOS XII.

[50]  Seth Copen Goldstein,et al.  Dataflow: A Complement to Superscalar , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[51]  Antonia Zhai,et al.  Triggered instructions: a control paradigm for spatially-programmed architectures , 2013, ISCA.

[52]  Carl Ebeling,et al.  PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs , 1995, Third International ACM Symposium on Field-Programmable Gate Arrays.

[53]  Raghuraman Mudumbai,et al.  On the Feasibility of Distributed Beamforming in Wireless Networks , 2007, IEEE Transactions on Wireless Communications.

[54]  Karthikeyan Sankaralingam,et al.  Dataflow Predication , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[55]  Tony Nowatzki,et al.  Stream-based Memory Access Specialization for General Purpose Processors , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[56]  Petar Popovski,et al.  The METIS 5G System Concept: Meeting the 5G Requirements , 2016, IEEE Communications Magazine.

[57]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[58]  Seth Copen Goldstein,et al.  Tartan: evaluating spatial computation for whole program execution , 2006, ASPLOS XII.

[59]  Yoav Etsion,et al.  Single-graph multiple flows: Energy efficient design alternative for GPGPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[60]  James C. Hoe,et al.  CoRAM++: Supporting data-structure-specific memory interfaces for FPGA computing , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[61]  Edward A. Lee,et al.  Hierarchical finite state machines with multiple concurrency models , 1999, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[62]  Christoforos E. Kozyrakis,et al.  Vector Lane Threading , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[63]  Mingoo Seok,et al.  Pipelining a Triggered Processing Element , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[64]  Eduard Ayguadé,et al.  Advanced Pattern based Memory Controller for FPGA based HPC applications , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[65]  Jongeun Lee,et al.  Flattening-based mapping of imperfect loop nests for CGRAs? , 2014, 2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).