论文信息 - Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

Author(s): Lee, Yunsup | Advisor(s): Asanovic, Krste | Abstract: As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide range of data-parallel architectures and their parallel programming models and compilers reveals an opportunity to construct a new data-parallel machine that is highly performant and efficient, yet a favorable compiler target that maintains the same level of programmability as the others.In this thesis, I present the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine. I reason through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture. The Hwacha vector unit is implemented in Chisel as an accelerator attached to a RISC-V Rocket control processor within the open-source Rocket Chip SoC generator. Using complete VLSI implementations of Hwacha, including a cache-coherent memory hierarchy in a commercial 28 nm process and simulated LPDDR3 DRAM modules, I quantify the area, performance, and energy consumption of the Hwacha accelerator. These numbers are then validated against an ARM Mali-T628 MP6 GPU, also built in a 28 nm process, using a set of OpenCL microbenchmarks compiled from the same source code with our custom compiler and ARM's stock OpenCL compiler.

Yunsup Lee | Yunsup Lee

[1] Scott A. Mahlke,et al. A comparison of full and partial predicated execution support for ILP processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[2] Micha Sharir,et al. Structural Analysis: A New Approach to Flow Analysis in Optimizing Compilers , 2015 .

[3] Michael Bedford Taylor,et al. A Landscape of the New Dark Silicon Design Regime , 2013, IEEE Micro.

[4] Christopher Batten,et al. The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[5] David Maier,et al. The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[6] James Demmel,et al. Precimonious: Tuning assistant for floating-point precision , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7] Benoît Dupont de Dinechin. Using the SSA-Form in a Code Generator , 2014, CC.

[8] Ken Kennedy,et al. Vector Register Allocation , 1992, IEEE Trans. Computers.

[9] Vikram Bhatt,et al. The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.

[10] Brian Kingsbury,et al. The T0 Vector Microprocessor , 2011 .

[11] Youngmin Shin,et al. 28nm high-K metal gate heterogeneous quad-core CPUs for high-performance and energy-efficient mobile application processor , 2013, 2013 International SoC Design Conference (ISOCC).

[12] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[13] John Wawrzynek,et al. T0: A Single-Chip Vector Microprocessor with Reconfigurable Pipelines , 1996, ESSCIRC '96: Proceedings of the 22nd European Solid-State Circuits Conference.

[14] Hunter Scales,et al. AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[15] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[16] Werner Buchholz. The IBM System/370 Vector Architecture , 1986, IBM Syst. J..

[17] Christopher Batten,et al. Simplified vector-thread architectures for flexible and efficient data-parallel accelerators , 2010 .

[18] David W. Nellans,et al. Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[19] John Wawrzynek,et al. Vector microprocessors , 1998 .

[20] Thomas Ball,et al. What's in a region?: or computing control dependence regions in near-linear time for reducible control flow , 1993, LOPL.

[21] Colin Schmidt,et al. Hwacha Preliminary Evaluation Results, Version 3.8.1 , 2015 .

[22] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[23] Ken Kennedy,et al. Conversion of control dependence to data dependence , 1983, POPL '83.

[24] Arthur J. Bernstein,et al. Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[25] Fernando Magno Quintão Pereira,et al. Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[26] David A. Patterson,et al. Scalable Vector Media-processors for Embedded Systems , 2002 .

[27] Bruce Jacob,et al. DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[28] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[29] Mateo Valero,et al. Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[30] Krste Asanovic,et al. Torrent Architecture Manual , 1997 .

[31] Hiroshi Tamura,et al. FACOM VP-100/200: Supercomputers with ease of use , 1985, Parallel Comput..

[32] Christopher Batten,et al. Implementing the scale vector-thread processor , 2008, TODE.

[33] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[34] Krste Asanovic,et al. Compiling for vector-thread architectures , 2008, CGO '08.

[35] Sudhakar Yalamanchili,et al. Dynamic compilation of data-parallel kernels for vector processors , 2012, CGO '12.

[36] Mark Hampton,et al. Reducing exception management overhead with software restart markers , 2008 .

[37] Marc Tremblay,et al. Rock: A High-Performance Sparc CMT Processor , 2009, IEEE Micro.

[38] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[39] David Patterson,et al. An Agile Approach to Building RISC-V Microprocessors , 2016, IEEE Micro.

[40] Krste Asanovic,et al. Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[41] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[42] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[43] AsanovićKrste,et al. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators , 2011 .

[44] Jarmo Takala,et al. pocl: A Performance-Portable OpenCL Implementation , 2014, International Journal of Parallel Programming.

[45] Christopher Torng,et al. Microarchitectural mechanisms to exploit value structure in SIMT architectures , 2013, ISCA.

[46] John Wawrzynek,et al. Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[47] William J. Dally,et al. A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors , 2012, TOCS.

[48] Alexandre E. Eichenberger,et al. Register allocation for predicated code , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[49] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[50] David Patterson,et al. The RISC-V Compressed Instruction Set Manual Version 1 . 9 Warning ! , 2015 .

[51] Christopher Batten,et al. Cache Refill/Access Decoupling for Vector Machines , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[52] Elad Alon,et al. A RISC-V vector processor with tightly-integrated switched-capacitor DC-DC converters in 28nm FDSOI , 2015, 2015 Symposium on VLSI Circuits (VLSI Circuits).

[53] R.H. Dennard,et al. Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[54] Vladimir Stojanovic,et al. Mixed Precision Vector Processors , 2015 .

[55] Ken Kennedy,et al. Practical dependence testing , 1991, PLDI '91.

[56] Vladimir M. Pentkovski,et al. Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[57] David A. Patterson,et al. Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[58] Elad Alon,et al. A RISC-V Vector Processor With Simultaneous-Switching Switched-Capacitor DC–DC Converters in 28 nm FDSOI , 2016, IEEE Journal of Solid-State Circuits.

[59] Ronny Krashinsky. Vector-thread architecture and implementation , 2007 .

[60] Arthur Stoutchinin,et al. Efficient static single assignment form for predication , 2001, MICRO.

[61] Christopher Batten,et al. Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..

[62] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[63] M. Pharr,et al. ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[64] Ken Kennedy,et al. Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[65] Uri C. Weiser,et al. MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[66] Sebastian Hack,et al. Improving Performance of OpenCL on CPUs , 2012, CC.

[67] Sudhakar Yalamanchili,et al. Characterization and transformation of unstructured control flow in bulk synchronous GPU applications , 2012, Int. J. High Perform. Comput. Appl..

[68] James E. Smith,et al. Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[69] Sylvain Collange,et al. Identifying scalar behavior in CUDA kernels , 2011 .

[70] Dileep Bhandarkar,et al. VAX vector architecture , 1990, ISCA '90.

[71] Dirk Grunwald,et al. A system level perspective on branch architecture performance , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[72] James E. Smith,et al. Decoupled access/execute computer architectures , 1984, TOCS.

[73] Gregory T. Byrd,et al. Multithreaded processor architectures , 1995 .

[74] Youngmin Shin,et al. 20nm High-K metal gate heterogeneous 64-bit quad-core CPUs and hexa-core GPU for high-performance and energy-efficient mobile application processor , 2015, 2015 International SoC Design Conference (ISOCC).

[75] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .

[76] Yunsup Lee,et al. A Case for MVPs : Mixed-Precision Vector Processors , 2014 .

[77] Michael Weiss. The transitive closure of control dependence: the iterated join , 1992, LOPL.

[78] Mateo Valero,et al. Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[79] Jaewook Shin. Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[80] John H. Reif,et al. Efficient Symbolic Analysis of Programs , 1986, J. Comput. Syst. Sci..

[81] Roy Dz-Ching Ju,et al. Global predicate analysis and its application to register allocation , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[82] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[83] Mike Murphy,et al. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[84] Rajeev J. Ram,et al. Single-chip microprocessor that communicates directly using light , 2015, Nature.

[85] Scott A. Mahlke,et al. Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[86] Tadashi Watanabe. Architecture and performance of NEC supercomputer SX system , 1987, Parallel Comput..

[87] Nam Sung Kim,et al. Power-efficient computing for compute-intensive GPGPU applications , 2012, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[88] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[89] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.

[90] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[91] Krste Asanovic,et al. The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7 , 2015 .

[92] Elad Alon,et al. Raven: A 28nm RISC-V vector processor with integrated switched-capacitor DC-DC converters and adaptive clocking , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[93] Marc Tremblay,et al. VIS speeds new media processing , 1996, IEEE Micro.

[94] Mark N. Wegman,et al. Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[95] Olaf M. Lubeck,et al. The birth of the second generation: the Hitachi S-820/80 , 1988, Proceedings. SUPERCOMPUTING '88.

[96] Yunsup Lee,et al. The RISC-V Instruction Set Manual , 2014 .

[97] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[98] Joe D. Warren,et al. The program dependence graph and its use in optimization , 1987, TOPL.

[99] M. Schlansker,et al. On Predicated Execution , 1991 .

[100] Yunsup Lee,et al. A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators , 2014, ESSCIRC 2014 - 40th European Solid State Circuits Conference (ESSCIRC).

[101] Krste Asanovic,et al. Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[102] Andrew Waterman,et al. Design of the RISC-V Instruction Set Architecture , 2016 .

[103] Andrew Waterman,et al. The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0 , 2014 .

[104] G.E. Moore,et al. No exponential is forever: but "Forever" can be delayed! [semiconductor industry] , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[105] Tor M. Aamodt,et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[106] Yunsup Lee. Efficient VLSI Implementations of Vector-Thread Architectures , 2011 .

[107] Arthur B. Maccabe,et al. The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages , 1990, PLDI '90.

[108] Sebastian Hack,et al. Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).