Decoupled Vector-Fetch Architecture with a Scalarizing Compiler
暂无分享,去创建一个
[1] Scott A. Mahlke,et al. A comparison of full and partial predicated execution support for ILP processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[2] Micha Sharir,et al. Structural Analysis: A New Approach to Flow Analysis in Optimizing Compilers , 2015 .
[3] Michael Bedford Taylor,et al. A Landscape of the New Dark Silicon Design Regime , 2013, IEEE Micro.
[4] Christopher Batten,et al. The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[5] David Maier,et al. The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.
[6] James Demmel,et al. Precimonious: Tuning assistant for floating-point precision , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[7] Benoît Dupont de Dinechin. Using the SSA-Form in a Code Generator , 2014, CC.
[8] Ken Kennedy,et al. Vector Register Allocation , 1992, IEEE Trans. Computers.
[9] Vikram Bhatt,et al. The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.
[10] Brian Kingsbury,et al. The T0 Vector Microprocessor , 2011 .
[11] Youngmin Shin,et al. 28nm high-K metal gate heterogeneous quad-core CPUs for high-performance and energy-efficient mobile application processor , 2013, 2013 International SoC Design Conference (ISOCC).
[12] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[13] John Wawrzynek,et al. T0: A Single-Chip Vector Microprocessor with Reconfigurable Pipelines , 1996, ESSCIRC '96: Proceedings of the 22nd European Solid-State Circuits Conference.
[14] Hunter Scales,et al. AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.
[15] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[16] Werner Buchholz. The IBM System/370 Vector Architecture , 1986, IBM Syst. J..
[17] Christopher Batten,et al. Simplified vector-thread architectures for flexible and efficient data-parallel accelerators , 2010 .
[18] David W. Nellans,et al. Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[19] John Wawrzynek,et al. Vector microprocessors , 1998 .
[20] Thomas Ball,et al. What's in a region?: or computing control dependence regions in near-linear time for reducible control flow , 1993, LOPL.
[21] Colin Schmidt,et al. Hwacha Preliminary Evaluation Results, Version 3.8.1 , 2015 .
[22] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[23] Ken Kennedy,et al. Conversion of control dependence to data dependence , 1983, POPL '83.
[24] Arthur J. Bernstein,et al. Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..
[25] Fernando Magno Quintão Pereira,et al. Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[26] David A. Patterson,et al. Scalable Vector Media-processors for Embedded Systems , 2002 .
[27] Bruce Jacob,et al. DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.
[28] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[29] Mateo Valero,et al. Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.
[30] Krste Asanovic,et al. Torrent Architecture Manual , 1997 .
[31] Hiroshi Tamura,et al. FACOM VP-100/200: Supercomputers with ease of use , 1985, Parallel Comput..
[32] Christopher Batten,et al. Implementing the scale vector-thread processor , 2008, TODE.
[33] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.
[34] Krste Asanovic,et al. Compiling for vector-thread architectures , 2008, CGO '08.
[35] Sudhakar Yalamanchili,et al. Dynamic compilation of data-parallel kernels for vector processors , 2012, CGO '12.
[36] Mark Hampton,et al. Reducing exception management overhead with software restart markers , 2008 .
[37] Marc Tremblay,et al. Rock: A High-Performance Sparc CMT Processor , 2009, IEEE Micro.
[38] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..
[39] David Patterson,et al. An Agile Approach to Building RISC-V Microprocessors , 2016, IEEE Micro.
[40] Krste Asanovic,et al. Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[41] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .
[42] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[43] AsanovićKrste,et al. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators , 2011 .
[44] Jarmo Takala,et al. pocl: A Performance-Portable OpenCL Implementation , 2014, International Journal of Parallel Programming.
[45] Christopher Torng,et al. Microarchitectural mechanisms to exploit value structure in SIMT architectures , 2013, ISCA.
[46] John Wawrzynek,et al. Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.
[47] William J. Dally,et al. A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors , 2012, TOCS.
[48] Alexandre E. Eichenberger,et al. Register allocation for predicated code , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.
[49] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[50] David Patterson,et al. The RISC-V Compressed Instruction Set Manual Version 1 . 9 Warning ! , 2015 .
[51] Christopher Batten,et al. Cache Refill/Access Decoupling for Vector Machines , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).
[52] Elad Alon,et al. A RISC-V vector processor with tightly-integrated switched-capacitor DC-DC converters in 28nm FDSOI , 2015, 2015 Symposium on VLSI Circuits (VLSI Circuits).
[53] R.H. Dennard,et al. Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.
[54] Vladimir Stojanovic,et al. Mixed Precision Vector Processors , 2015 .
[55] Ken Kennedy,et al. Practical dependence testing , 1991, PLDI '91.
[56] Vladimir M. Pentkovski,et al. Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.
[57] David A. Patterson,et al. Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .
[58] Elad Alon,et al. A RISC-V Vector Processor With Simultaneous-Switching Switched-Capacitor DC–DC Converters in 28 nm FDSOI , 2016, IEEE Journal of Solid-State Circuits.
[59] Ronny Krashinsky. Vector-thread architecture and implementation , 2007 .
[60] Arthur Stoutchinin,et al. Efficient static single assignment form for predication , 2001, MICRO.
[61] Christopher Batten,et al. Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators , 2013, ACM Trans. Comput. Syst..
[62] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[63] M. Pharr,et al. ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).
[64] Ken Kennedy,et al. Automatic translation of FORTRAN programs to vector form , 1987, TOPL.
[65] Uri C. Weiser,et al. MMX technology extension to the Intel architecture , 1996, IEEE Micro.
[66] Sebastian Hack,et al. Improving Performance of OpenCL on CPUs , 2012, CC.
[67] Sudhakar Yalamanchili,et al. Characterization and transformation of unstructured control flow in bulk synchronous GPU applications , 2012, Int. J. High Perform. Comput. Appl..
[68] James E. Smith,et al. Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[69] Sylvain Collange,et al. Identifying scalar behavior in CUDA kernels , 2011 .
[70] Dileep Bhandarkar,et al. VAX vector architecture , 1990, ISCA '90.
[71] Dirk Grunwald,et al. A system level perspective on branch architecture performance , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.
[72] James E. Smith,et al. Decoupled access/execute computer architectures , 1984, TOCS.
[73] Gregory T. Byrd,et al. Multithreaded processor architectures , 1995 .
[74] Youngmin Shin,et al. 20nm High-K metal gate heterogeneous 64-bit quad-core CPUs and hexa-core GPU for high-performance and energy-efficient mobile application processor , 2015, 2015 International SoC Design Conference (ISOCC).
[75] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[76] Yunsup Lee,et al. A Case for MVPs : Mixed-Precision Vector Processors , 2014 .
[77] Michael Weiss. The transitive closure of control dependence: the iterated join , 1992, LOPL.
[78] Mateo Valero,et al. Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.
[79] Jaewook Shin. Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).
[80] John H. Reif,et al. Efficient Symbolic Analysis of Programs , 1986, J. Comput. Syst. Sci..
[81] Roy Dz-Ching Ju,et al. Global predicate analysis and its application to register allocation , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.
[82] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[83] Mike Murphy,et al. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.
[84] Rajeev J. Ram,et al. Single-chip microprocessor that communicates directly using light , 2015, Nature.
[85] Scott A. Mahlke,et al. Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.
[86] Tadashi Watanabe. Architecture and performance of NEC supercomputer SX system , 1987, Parallel Comput..
[87] Nam Sung Kim,et al. Power-efficient computing for compute-intensive GPGPU applications , 2012, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[88] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[89] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.
[90] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[91] Krste Asanovic,et al. The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7 , 2015 .
[92] Elad Alon,et al. Raven: A 28nm RISC-V vector processor with integrated switched-capacitor DC-DC converters and adaptive clocking , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).
[93] Marc Tremblay,et al. VIS speeds new media processing , 1996, IEEE Micro.
[94] Mark N. Wegman,et al. Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.
[95] Olaf M. Lubeck,et al. The birth of the second generation: the Hitachi S-820/80 , 1988, Proceedings. SUPERCOMPUTING '88.
[96] Yunsup Lee,et al. The RISC-V Instruction Set Manual , 2014 .
[97] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[98] Joe D. Warren,et al. The program dependence graph and its use in optimization , 1987, TOPL.
[99] M. Schlansker,et al. On Predicated Execution , 1991 .
[100] Yunsup Lee,et al. A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators , 2014, ESSCIRC 2014 - 40th European Solid State Circuits Conference (ESSCIRC).
[101] Krste Asanovic,et al. Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[102] Andrew Waterman,et al. Design of the RISC-V Instruction Set Architecture , 2016 .
[103] Andrew Waterman,et al. The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0 , 2014 .
[104] G.E. Moore,et al. No exponential is forever: but "Forever" can be delayed! [semiconductor industry] , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..
[105] Tor M. Aamodt,et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.
[106] Yunsup Lee. Efficient VLSI Implementations of Vector-Thread Architectures , 2011 .
[107] Arthur B. Maccabe,et al. The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages , 1990, PLDI '90.
[108] Sebastian Hack,et al. Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).