Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms

Dense linear algebra kernels are critical for wireless applications, and the oncoming proliferation of 5G only amplifies their importance. Many such matrix algorithms are inductive, and exhibit ample amounts of fine-grain ordered parallelism -- when multiple computations flow with fine-grain producer/consumer dependences, and where the iteration domain is not easily tileable. Synchronization overheads make multi-core parallelism ineffective and the non-tileable iterations make the vector-VLIW approach less effective, especially for the typically modest-sized matrices. Because CPUs and DSPs lose order-of-magnitude performance/hardware utilization, costly and inflexible ASICs are often employed in signal processing pipelines. A programmable accelerator with similar performance/power/area would be highly desirable. We find that fine-grain ordered parallelism can be exploited by supporting: 1. fine-grain stream-based communication/synchronization; 2. inductive data-reuse and memory access patterns; 3. implicit vector-masking for partial vectors; 4. hardware specialization of dataflow criticality. In this work, we propose, REVEL, as a next-generation DSP architecture. It supports the above features in its ISA and microarchitecture, and further uses a novel vector-stream control paradigm to reduce control overheads. Across a suite of linear algebra kernels, REVEL outperforms equally provisioned DSPs by 4.6x-37x in latency and achieves a performance per mm 2 of 8.3x. It is only 2.2x higher power to achieve the same performance as ideal ASICs, at about 55% of the combined area.

[1]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[2]  Kunle Olukotun,et al.  Generating Configurable Hardware from Parallel Patterns , 2015, International Conference on Architectural Support for Programming Languages and Operating Systems.

[3]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[4]  Jian Weng,et al.  Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign , 2018, PACT.

[5]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[6]  Karthikeyan Sankaralingam,et al.  A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.

[7]  Seth Copen Goldstein,et al.  Tartan: evaluating spatial computation for whole program execution , 2006, ASPLOS XII.

[8]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[9]  T. Knight,et al.  Pathfinder : A Negotiation-Based Performance-Driven Router for FPGAs , 2012 .

[10]  A. Happonen,et al.  DSP implementation of Cholesky decomposition , 2006, Joint IST Workshop on Mobile Future, 2006 and the Symposium on Trends in Communications. SympoTIC '06..

[11]  Raghuraman Mudumbai,et al.  On the Feasibility of Distributed Beamforming in Wireless Networks , 2007, IEEE Transactions on Wireless Communications.

[12]  Yoav Etsion,et al.  Single-graph multiple flows: Energy efficient design alternative for GPGPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[13]  James C. Hoe,et al.  CoRAM++: Supporting data-structure-specific memory interfaces for FPGA computing , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[14]  C. Batten,et al.  Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[16]  Robert A. van de Geijn,et al.  Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator , 2014, IEEE Transactions on Computers.

[17]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[18]  Alec Roelke RISC5: Implementing the RISC-V ISA in gem5 , 2017 .

[19]  Kunle Olukotun,et al.  Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  Fadi J. Kurdahi,et al.  MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications , 2000, IEEE Trans. Computers.

[21]  Yoav Etsion,et al.  Inter-Thread Communication in Multithreaded, Reconfigurable Coarse-Grain Arrays , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Håkan Johansson,et al.  Polyphase Decomposition of Digital Fractional-Delay Filters , 2015, IEEE Signal Processing Letters.

[23]  Ruijie Zhao WLS design of centro-symmetric 2-D FIR filters using matrix iterative algorithm , 2015, 2015 IEEE International Conference on Digital Signal Processing (DSP).

[24]  Karthikeyan Sankaralingam,et al.  Stream-dataflow acceleration , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[25]  Kenneth A. Ross,et al.  Q100: the architecture and design of a database processing unit , 2014, ASPLOS.

[26]  Karthikeyan Sankaralingam,et al.  Pushing the Limits of Accelerator Efficiency While Retaining Programmability , 2017 .

[27]  Seth Copen Goldstein,et al.  PipeRench: a co/processor for streaming multimedia acceleration , 1999, ISCA.

[28]  Robert A. van Engelen,et al.  Efficient Symbolic Analysis for Optimizing Compilers , 2001, CC.

[29]  Praveen Raghavan,et al.  Energy-Efficient Communication Processors: Design and Implementation for Emerging Wireless Systems , 2013 .

[30]  P. Glenn Gulak,et al.  A low-complexity high-speed QR decomposition implementation for MIMO receivers , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[31]  Scott A. Mahlke,et al.  Edge-centric modulo scheduling for coarse-grained reconfigurable architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Jason Cong,et al.  A Fully Pipelined and Dynamically Composable Architecture of CGRA , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[33]  Antonia Zhai,et al.  Triggered instructions: a control paradigm for spatially-programmed architectures , 2013, ISCA.

[34]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[35]  F. Mintzer,et al.  On half-band, third-band, and Nth-band FIR filters and their design , 1982 .

[36]  Christoforos E. Kozyrakis,et al.  Vector Lane Threading , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[37]  Mingoo Seok,et al.  Pipelining a Triggered Processing Element , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Eduard Ayguadé,et al.  Advanced Pattern based Memory Controller for FPGA based HPC applications , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[39]  P. B. Darwood,et al.  LMMSE chip equalisation for 3GPP WCDMA downlink receivers with channel coding , 2001, ICC 2001. IEEE International Conference on Communications. Conference Record (Cat. No.01CH37240).

[40]  Scott A. Mahlke,et al.  Libra: Tailoring SIMD Execution Using Heterogeneous Hardware and Dynamic Configurability , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[41]  Cong Yan,et al.  A scalable architecture for ordered parallelism , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42]  George Carayannis,et al.  Speech enhancement from noise: A regenerative approach , 1991, Speech Commun..

[43]  Ali Saidi,et al.  The Reconfigurable Streaming Vector Processor (RSVP , 2003 .

[44]  Rudy Lauwereins,et al.  Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling , 2003, DATE.

[45]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[46]  Yoav Etsion,et al.  Control flow coalescing on a hybrid dataflow/von Neumann GPGPU , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Andrew B. Kahng,et al.  CACTI 7 , 2017, ACM Trans. Archit. Code Optim..

[48]  Steven Swanson,et al.  Instruction scheduling for a tiled dataflow architecture , 2006, ASPLOS XII.

[49]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[50]  Gu-Yeon Wei,et al.  Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).