Transmuter: Bridging the Efficiency Gap using Memory and Dataflow Reconfiguration

With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build hardware for emerging applications that meet power and performance targets, while remaining flexible and programmable for end users. This is particularly true for domains that have frequently changing algorithms and applications involving mixed sparse/dense data structures, such as those in machine learning and graph analytics. To overcome this, we present a flexible accelerator called Transmuter, in a novel effort to bridge the gap between General-Purpose Processors (GPPs) and Application-Specific Integrated Circuits (ASICs). Transmuter adapts to changing kernel characteristics, such as data reuse and control divergence, through the ability to reconfigure the on-chip memory type, resource sharing and dataflow at run-time within a short latency. This is facilitated by a fabric of light-weight cores connected to a network of reconfigurable caches and crossbars. Transmuter addresses a rapidly growing set of algorithms exhibiting dynamic data movement patterns, irregularity, and sparsity, while delivering GPU-like efficiencies for traditional dense applications. Finally, in order to support programmability and ease-of-adoption, we prototype a software stack composed of low-level runtime routines, and a high-level language library called TransPy, that cater to expert programmers and end-users, respectively. Our evaluations with Transmuter demonstrate average throughput (energy-efficiency) improvements of 5.0× (18.4×) and 4.2× (4.0×) over a high-end CPU and GPU, respectively, across a diverse set of kernels predominant in graph analytics, scientific computing and machine learning. Transmuter achieves energy-efficiency gains averaging 3.4× and 2.0× over prior FPGA and CGRA implementations of the same kernels, while remaining on average within 9.3× of state-of-the-art ASICs.

[1]  Andreas Gerstlauer,et al.  A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores , 2014, J. Signal Process. Syst..

[2]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[3]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[4]  Benton H. Calhoun,et al.  Flexible Circuits and Architectures for Ultralow Power , 2010, Proceedings of the IEEE.

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[7]  Chang-Hwan Lee A gradient approach for value weighted classification learning in naive Bayes , 2015, Knowl. Based Syst..

[8]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[9]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[10]  Tulika Mitra,et al.  Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[11]  David Blaauw,et al.  Swizzle-Switch Networks for Many-Core Systems , 2012, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[12]  Chang-Chi Lee,et al.  An Overview of the Development of a GPU with Integrated HBM on Silicon Interposer , 2016, 2016 IEEE 66th Electronic Components and Technology Conference (ECTC).

[13]  Iain S. Duff,et al.  An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum , 2002, TOMS.

[14]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[15]  Franz Franchetti,et al.  Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[16]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[17]  F. Mueller Pthreads Library Interface , 1993 .

[18]  Sherief Reda,et al.  LACore: A Supercomputing-Like Linear Algebra Accelerator for SoC-Based Designs , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[19]  Gerald Penn,et al.  Efficient transitive closure of sparse matrices over closed semirings , 2006, Theor. Comput. Sci..

[20]  Henk Corporaal,et al.  Coarse grained reconfigurable architectures in the past 25 years: Overview and classification , 2016, 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS).

[21]  黄莺 Xilinx:创新求变 逆风飞扬 , 2002 .

[22]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  Earl E. Swartzlander Systolic FFT Processors: Past, Present and Future , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[24]  Ahmed Hemani,et al.  39.9 GOPs/watt multi-mode CGRA accelerator for a multi-standard basestation , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[25]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[26]  Kunle Olukotun,et al.  Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[27]  Kunle Olukotun,et al.  Spatial: a language and compiler for application accelerators , 2018, PLDI.

[28]  J.D. Shott,et al.  The future of automation for high-volume Wafer fabrication and ASIC manufacturing , 1986, Proceedings of the IEEE.

[29]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[30]  Ichitaro Yamazaki,et al.  On Techniques to Improve Robustness and Scalability of a Parallel Hybrid Linear Solver , 2010, VECPAR.

[31]  Satoshi Itoh,et al.  Order-N tight-binding molecular dynamics on parallel computers , 1995 .

[32]  Dejan Markovic,et al.  A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs , 2014, FPGA.

[33]  Shinpei Kato,et al.  Data Transfer Matters for GPU Computing , 2013, 2013 International Conference on Parallel and Distributed Systems.

[34]  Karthikeyan Sankaralingam,et al.  Stream-dataflow acceleration , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[35]  Cao Liang,et al.  SmartCell: A power-efficient reconfigurable architecture for data streaming applications , 2008, 2008 IEEE Workshop on Signal Processing Systems.

[36]  Jack J. Dongarra,et al.  Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication , 2013, ICS '13.

[37]  Vaishali Tehre,et al.  Implementation of Fast Fourier Transform Accelerator on Coarse Grain Reconfigurable Architecture , 2016 .

[38]  Yale N. Patt,et al.  MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[39]  Jonathan Rose,et al.  Measuring the Gap Between FPGAs and ASICs , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[40]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[41]  Jonathan E. Scalera,et al.  A Systolic FFT Architecture for Real Time FPGA Systems , 2005 .

[42]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[43]  Trevor Mudge,et al.  Accelerating Linear Algebra Kernels on a Massively Parallel Reconfigurable Architecture , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Todd M. Austin,et al.  Polymorphic On-Chip Networks , 2008, 2008 International Symposium on Computer Architecture.

[45]  R. Keys Cubic convolution interpolation for digital image processing , 1981 .

[46]  William J. Dally,et al.  Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Brucek Khailany,et al.  Timeloop: A Systematic Approach to DNN Accelerator Evaluation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[48]  Franz Franchetti,et al.  Mathematical foundations of the GraphBLAS , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[49]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[50]  Ronald G. Dreslinski,et al.  Parallelism Analysis of Prominent Desktop Applications: An 18- Year Perspective , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[51]  Jack J. Dongarra,et al.  High-Performance Matrix-Matrix Multiplications of Very Small Matrices , 2016, Euro-Par.

[52]  Marian Verhelst,et al.  A 128∶2048/1536 point FFT hardware implementation with output pruning , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[53]  Robert A. van de Geijn,et al.  A high-performance, low-power linear algebra core , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[54]  Muhammad Shafique,et al.  PX-CGRA: Polymorphic approximate coarse-grained reconfigurable architecture , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[55]  C. Nicol A Coarse Grain Reconfigurable Array ( CGRA ) for Statically Scheduled Data Flow Computing , 2017 .

[56]  Muhammad Shafique,et al.  X-CGRA: An Energy-Efficient Approximate Coarse-Grained Reconfigurable Architecture , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[57]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[58]  Seth Copen Goldstein,et al.  PipeRench: A Reconfigurable Architecture and Compiler , 2000, Computer.

[59]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[60]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[61]  James Demmel,et al.  Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[62]  Han Zhang,et al.  Improving GANs Using Optimal Transport , 2018, ICLR.

[63]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[64]  Jure Leskovec,et al.  Learning Structural Node Embeddings via Diffusion Wavelets , 2017, KDD.

[65]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Andreas Herkersdorf,et al.  Enabling FPGAs in Hyperscale Data Centers , 2015, 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom).

[67]  Heiner Giefers,et al.  Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT , 2015, Journal of Signal Processing Systems.

[68]  Noriyuki Fujimoto Dense Matrix-Vector Multiplication on the CUDA Architecture , 2008, Parallel Process. Lett..

[69]  Badri Narayan Mohapatra,et al.  FFT and sparse FFT techniques and applications , 2017, 2017 Fourteenth International Conference on Wireless and Optical Communications Networks (WOCN).

[70]  Richard Socher,et al.  An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[71]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Martin Burtscher,et al.  Microarchitectural performance characterization of irregular GPU kernels , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[73]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[74]  Trevor Mudge,et al.  A 7.3 M Output Non-Zeros/J Sparse Matrix-Matrix Multiplication Accelerator using Memory Reconfiguration in 40 nm , 2019, 2019 Symposium on VLSI Technology.

[75]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[76]  L. V. Gutierrez,et al.  ASIC Clouds: Specializing the Datacenter , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[77]  Christopher Torng,et al.  The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips , 2018, IEEE Micro.

[78]  Peter Marwedel,et al.  Cache-aware scratchpad allocation algorithm , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[79]  Jun Yao,et al.  A CGRA-Based Approach for Accelerating Convolutional Neural Networks , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.

[80]  Michael Hübner,et al.  CGRA Tool Flow for Fast Run-Time Reconfiguration , 2018, ARC.

[81]  Trevor Mudge,et al.  A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator , 2020, IEEE Journal of Solid-State Circuits.

[82]  Ali R. Hurson,et al.  General-purpose systolic arrays , 1993, Computer.

[83]  Steven J. E. Wilton,et al.  A detailed power model for field-programmable gate arrays , 2005, TODE.

[84]  Muhammad Shafique,et al.  Computing in the Dark Silicon Era: Current Trends and Research Challenges , 2017, IEEE Design & Test.

[85]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[86]  Fanghua Ye,et al.  Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection , 2018, CIKM.

[87]  Michael Stonebraker,et al.  Standards for graph algorithm primitives , 2014, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[88]  Constantine Bekas,et al.  Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[89]  Russell Tessier,et al.  FPGA Architecture: Survey and Challenges , 2008, Found. Trends Electron. Des. Autom..

[90]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[91]  Sarita V. Adve,et al.  Stash: Have your scratchpad and cache it too , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[92]  H. T. Kung Why systolic architectures? , 1982, Computer.

[93]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[94]  David Blaauw,et al.  Scaling towards kilo-core processors with asymmetric high-radix topologies , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[95]  Luca Benini,et al.  NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22 nm FD-SOI , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[96]  Ludek Matyska,et al.  Optimizing CUDA code by kernel fusion: application on BLAS , 2013, The Journal of Supercomputing.

[97]  Georgi Kuzmanov,et al.  Reconfigurable sparse/dense matrix-vector multiplier , 2009, 2009 International Conference on Field-Programmable Technology.

[98]  H. T. Kung,et al.  Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization , 2018, ASPLOS.

[99]  Trevor N. Mudge,et al.  Sparse-TPU: adapting systolic arrays for sparse matrices , 2020, ICS.

[100]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[101]  Tulika Mitra,et al.  HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[102]  Chen Lin,et al.  MaPU: A novel mathematical computing architecture , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[103]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[104]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[105]  Kiyoung Choi,et al.  FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability , 2009, 2009 International Conference on Field-Programmable Technology.

[106]  DengYangdong,et al.  A Survey of Coarse-Grained Reconfigurable Architecture and Design , 2019 .

[107]  Karin Strauss,et al.  Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[108]  André F. T. Martins,et al.  Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[109]  Edward H. Adelson,et al.  PYRAMID METHODS IN IMAGE PROCESSING. , 1984 .

[110]  Norman P. Jouppi,et al.  Architecting Efficient Interconnects for Large Caches with CACTI 6.0 , 2008, IEEE Micro.

[111]  David Blaauw,et al.  Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[112]  Xingjian Li,et al.  An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs , 2012, ICS '12.

[113]  Michael Bussmann,et al.  Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library , 2017, ISC Workshops.

[114]  Mingyu Gao,et al.  HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[115]  David Blaauw,et al.  OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[116]  Dong Wang,et al.  An Energy-Efficient Coarse-Grained Reconfigurable Processing Unit for Multiple-Standard Video Decoding , 2015, IEEE Transactions on Multimedia.

[117]  Jian Weng,et al.  Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms , 2019, MICRO.

[118]  David F. Bacon,et al.  FPGA programming for the masses , 2013, CACM.

[119]  Yangdong Deng,et al.  Taming irregular EDA applications on GPUs , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.