Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

Author(s): Lee, Yunsup | Advisor(s): Asanovic, Krste | Abstract: As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide range of data-parallel architectures and their parallel programming models and compilers reveals an opportunity to construct a new data-parallel machine that is highly performant and efficient, yet a favorable compiler target that maintains the same level of programmability as the others.In this thesis, I present the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine. I reason through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture. The Hwacha vector unit is implemented in Chisel as an accelerator attached to a RISC-V Rocket control processor within the open-source Rocket Chip SoC generator. Using complete VLSI implementations of Hwacha, including a cache-coherent memory hierarchy in a commercial 28 nm process and simulated LPDDR3 DRAM modules, I quantify the area, performance, and energy consumption of the Hwacha accelerator. These numbers are then validated against an ARM Mali-T628 MP6 GPU, also built in a 28 nm process, using a set of OpenCL microbenchmarks compiled from the same source code with our custom compiler and ARM's stock OpenCL compiler.

[1]  Scott A. Mahlke,et al.  A comparison of full and partial predicated execution support for ILP processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[2]  Micha Sharir,et al.  Structural Analysis: A New Approach to Flow Analysis in Optimizing Compilers , 2015 .

[3]  Michael Bedford Taylor,et al.  A Landscape of the New Dark Silicon Design Regime , 2013, IEEE Micro.

[4]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[5]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[6]  James Demmel,et al.  Precimonious: Tuning assistant for floating-point precision , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Benoît Dupont de Dinechin Using the SSA-Form in a Code Generator , 2014, CC.

[8]  Ken Kennedy,et al.  Vector Register Allocation , 1992, IEEE Trans. Computers.

[9]  Vikram Bhatt,et al.  The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.

[10]  Brian Kingsbury,et al.  The T0 Vector Microprocessor , 2011 .

[11]  Youngmin Shin,et al.  28nm high-K metal gate heterogeneous quad-core CPUs for high-performance and energy-efficient mobile application processor , 2013, 2013 International SoC Design Conference (ISOCC).

[12]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[13]  John Wawrzynek,et al.  T0: A Single-Chip Vector Microprocessor with Reconfigurable Pipelines , 1996, ESSCIRC '96: Proceedings of the 22nd European Solid-State Circuits Conference.

[14]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[15]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[16]  Werner Buchholz The IBM System/370 Vector Architecture , 1986, IBM Syst. J..

[17]  Christopher Batten,et al.  Simplified vector-thread architectures for flexible and efficient data-parallel accelerators , 2010 .

[18]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[19]  John Wawrzynek,et al.  Vector microprocessors , 1998 .

[20]  Thomas Ball,et al.  What's in a region?: or computing control dependence regions in near-linear time for reducible control flow , 1993, LOPL.

[21]  Colin Schmidt,et al.  Hwacha Preliminary Evaluation Results, Version 3.8.1 , 2015 .

[22]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[23]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[24]  Arthur J. Bernstein,et al.  Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[25]  Fernando Magno Quintão Pereira,et al.  Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[26]  David A. Patterson,et al.  Scalable Vector Media-processors for Embedded Systems , 2002 .

[27]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[28]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[29]  Mateo Valero,et al.  Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[30]  Krste Asanovic,et al.  Torrent Architecture Manual , 1997 .

[31]  Hiroshi Tamura,et al.  FACOM VP-100/200: Supercomputers with ease of use , 1985, Parallel Comput..

[32]  Christopher Batten,et al.  Implementing the scale vector-thread processor , 2008, TODE.

[33]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[34]  Krste Asanovic,et al.  Compiling for vector-thread architectures , 2008, CGO '08.

[35]  Sudhakar Yalamanchili,et al.  Dynamic compilation of data-parallel kernels for vector processors , 2012, CGO '12.

[36]  Mark Hampton,et al.  Reducing exception management overhead with software restart markers , 2008 .

[37]  Marc Tremblay,et al.  Rock: A High-Performance Sparc CMT Processor , 2009, IEEE Micro.

[38]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[39]  David Patterson,et al.  An Agile Approach to Building RISC-V Microprocessors , 2016, IEEE Micro.

[40]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[41]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[42]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[43]  AsanovićKrste,et al.  Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators , 2011 .

[44]  Jarmo Takala,et al.  pocl: A Performance-Portable OpenCL Implementation , 2014, International Journal of Parallel Programming.

[45]  Christopher Torng,et al.  Microarchitectural mechanisms to exploit value structure in SIMT architectures , 2013, ISCA.

[46]  John Wawrzynek,et al.  Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[47]  William J. Dally,et al.  A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors , 2012, TOCS.

[48]  Alexandre E. Eichenberger,et al.  Register allocation for predicated code , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[49]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[50]  David Patterson,et al.  The RISC-V Compressed Instruction Set Manual Version 1 . 9 Warning ! , 2015 .

[51]  Christopher Batten,et al.  Cache Refill/Access Decoupling for Vector Machines , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[52]  Elad Alon,et al.  A RISC-V vector processor with tightly-integrated switched-capacitor DC-DC converters in 28nm FDSOI , 2015, 2015 Symposium on VLSI Circuits (VLSI Circuits).

[53]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[54]  Vladimir Stojanovic,et al.  Mixed Precision Vector Processors , 2015 .

[55]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[56]  Vladimir M. Pentkovski,et al.  Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[57]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[58]  Elad Alon,et al.  A RISC-V Vector Processor With Simultaneous-Switching Switched-Capacitor DC–DC Converters in 28 nm FDSOI , 2016, IEEE Journal of Solid-State Circuits.

[59]  Ronny Krashinsky Vector-thread architecture and implementation , 2007 .

[60]  Arthur Stoutchinin,et al.  Efficient static single assignment form for predication , 2001, MICRO.

[61]  Christopher Batten,et al.  Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..

[62]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[63]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[64]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[65]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[66]  Sebastian Hack,et al.  Improving Performance of OpenCL on CPUs , 2012, CC.

[67]  Sudhakar Yalamanchili,et al.  Characterization and transformation of unstructured control flow in bulk synchronous GPU applications , 2012, Int. J. High Perform. Comput. Appl..

[68]  James E. Smith,et al.  Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[69]  Sylvain Collange,et al.  Identifying scalar behavior in CUDA kernels , 2011 .

[70]  Dileep Bhandarkar,et al.  VAX vector architecture , 1990, ISCA '90.

[71]  Dirk Grunwald,et al.  A system level perspective on branch architecture performance , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[72]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[73]  Gregory T. Byrd,et al.  Multithreaded processor architectures , 1995 .

[74]  Youngmin Shin,et al.  20nm High-K metal gate heterogeneous 64-bit quad-core CPUs and hexa-core GPU for high-performance and energy-efficient mobile application processor , 2015, 2015 International SoC Design Conference (ISOCC).

[75]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[76]  Yunsup Lee,et al.  A Case for MVPs : Mixed-Precision Vector Processors , 2014 .

[77]  Michael Weiss The transitive closure of control dependence: the iterated join , 1992, LOPL.

[78]  Mateo Valero,et al.  Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[79]  Jaewook Shin Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[80]  John H. Reif,et al.  Efficient Symbolic Analysis of Programs , 1986, J. Comput. Syst. Sci..

[81]  Roy Dz-Ching Ju,et al.  Global predicate analysis and its application to register allocation , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[82]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[83]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[84]  Rajeev J. Ram,et al.  Single-chip microprocessor that communicates directly using light , 2015, Nature.

[85]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[86]  Tadashi Watanabe Architecture and performance of NEC supercomputer SX system , 1987, Parallel Comput..

[87]  Nam Sung Kim,et al.  Power-efficient computing for compute-intensive GPGPU applications , 2012, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[88]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[89]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[90]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[91]  Krste Asanovic,et al.  The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7 , 2015 .

[92]  Elad Alon,et al.  Raven: A 28nm RISC-V vector processor with integrated switched-capacitor DC-DC converters and adaptive clocking , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[93]  Marc Tremblay,et al.  VIS speeds new media processing , 1996, IEEE Micro.

[94]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[95]  Olaf M. Lubeck,et al.  The birth of the second generation: the Hitachi S-820/80 , 1988, Proceedings. SUPERCOMPUTING '88.

[96]  Yunsup Lee,et al.  The RISC-V Instruction Set Manual , 2014 .

[97]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[98]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[99]  M. Schlansker,et al.  On Predicated Execution , 1991 .

[100]  Yunsup Lee,et al.  A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators , 2014, ESSCIRC 2014 - 40th European Solid State Circuits Conference (ESSCIRC).

[101]  Krste Asanovic,et al.  Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[102]  Andrew Waterman,et al.  Design of the RISC-V Instruction Set Architecture , 2016 .

[103]  Andrew Waterman,et al.  The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0 , 2014 .

[104]  G.E. Moore,et al.  No exponential is forever: but "Forever" can be delayed! [semiconductor industry] , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[105]  Tor M. Aamodt,et al.  Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[106]  Yunsup Lee Efficient VLSI Implementations of Vector-Thread Architectures , 2011 .

[107]  Arthur B. Maccabe,et al.  The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages , 1990, PLDI '90.

[108]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).