论文信息 - Low-complexity vector microprocessor extension

Low-complexity vector microprocessor extension

For the last few years, single-thread performance has been improving at a snail’s pace. Power limitations, increasing relative memory latency, and the exhaustion of improvement in instruction-level parallelism are forcing microprocessor architects to examine new processor design strategies. In this dissertation, I take a look at a technology that can improve the efficiency of modern microprocessors: vectors. Vectors are a simple, power-efficient way to take advantage of common data-level parallelism in an extensible, easily-programmable manner. My work focuses on the process of transitioning from traditional scalar microprocessors to computers that can take advantage of vectors. First, I describe a process for extending existing single-instruction, multiple-data instruction sets to support full vector processing, in a way that remains binary compatible with existing applications. Initial implementations can be low cost, but be transparently extended to higher performance later. I also describe ViVA, the Virtual Vector Architecture. ViVA adds vector-style memory operations to existing microprocessors but does not include arithmetic datapaths; instead, memory instructions work with a new buffer placed between the core and second-level cache. ViVA serves as a low-cost solution to getting much of the performance of full vector memory hierarchies while avoiding the complexity of adding a full vector system. Finally, I test the performance of ViVA by modifying a cycle-accurate full-system simulator to support ViVA’s operation. After extensive calibration, I test the basic performance of ViVA using a series of microbenchmarks. I compare the performance of a variety of ViVA configurations for corner turn, used in processing multidimensional data, and sparse matrix-vector multiplication, used in many scientific applications. Results show that ViVA can give significant benefit for a variety of memory access patterns, without relying on a costly hardware prefetcher.

David A. Patterson | Joseph James Gebis | D. Patterson | Joseph Gebis

[1] George L.-T. Chiu,et al. Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[2] Christoforos E. Kozyrakis,et al. A case for intelligent RAM , 1997, IEEE Micro.

[3] John K. Ousterhout. Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[4] Larry Rudolph,et al. Gang Scheduling Performance Benefits for Fine-Grain Synchronization , 1992, J. Parallel Distributed Comput..

[5] Tzi-cker Chiueh. Multi-threaded vectorization , 1991, ISCA '91.

[6] P.H. Worley,et al. Early Evaluation of the Cray X1 , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.

[8] Ahmed Sameh,et al. The Illiac IV system , 1972 .

[9] Mateo Valero,et al. Adding a vector unit to a superscalar processor , 1999, ICS '99.

[10] Ole Agesen,et al. A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[11] James E. Smith,et al. Decoupled access/execute computer architectures , 1984, TOCS.

[12] Chris R. Jesshope. Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines , 2001 .

[13] Jeffrey S. Vetter. Cray X1 Evaluation Status Report , 2004 .

[14] Guy E. Blelloch,et al. AD-A 270 601 Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors , 1993 .

[15] Carl Staelin,et al. lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[16] James L. Peterson,et al. Design and validation of a performance and power simulator for PowerPC systems , 2003, IBM J. Res. Dev..

[17] Balaram Sinharoy,et al. POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[18] P. Colella,et al. Local adaptive mesh refinement for shock hydrodynamics , 1989 .

[19] Yu Zhang,et al. Parallelization of IBM mambo system simulator in functional modes , 2008, OPSR.

[20] Karthick Rajamani,et al. Application of full-system simulation in exploratory system design and development , 2006, IBM J. Res. Dev..

[21] T. Skotnicki,et al. The end of CMOS scaling: toward the introduction of new materials and structural changes to improve MOSFET performance , 2005, IEEE Circuits and Devices Magazine.

[22] Ronald G. Dreslinski,et al. Analysis of hardware prefetching across virtual page boundaries , 2007, CF '07.

[23] Dean M. Tullsen,et al. Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[24] Scott Devine,et al. Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[25] Michael J. Flynn,et al. Intrinsic multiprocessing , 1967, AFIPS '67 (Spring).

[26] Rajeev Balasubramonian,et al. Reducing the complexity of the register file in dynamic superscalar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[27] Emil Talpes,et al. Execution cache-based microarchitecture for power-efficient superscalar processors , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[28] Michael Wolfe,et al. High performance compilers for parallel computing , 1995 .

[29] Larry L. Biro,et al. Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[30] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[31] Zhen Fang,et al. The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[32] R. Kumar,et al. An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[33] Kentaro Shimada,et al. A superscalar RISC processor with 160 FPRs for large scale scientific processing , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[34] W. Wasow,et al. Finite-Difference Methods for Partial Differential Equations , 1961 .

[35] Michael J. Flynn,et al. Very high-speed computing systems , 1966 .

[36] John Wawrzynek,et al. T0: A Single-Chip Vector Microprocessor with Reconfigurable Pipelines , 1996, ESSCIRC '96: Proceedings of the 22nd European Solid-State Circuits Conference.

[37] Werner Buchholz. The IBM System/370 Vector Architecture , 1986, IBM Syst. J..

[38] Josep Torrellas,et al. A Brief Description of the NMP ISA and Benchmarks , 2005 .

[39] Greg Grohoski. Niagara-2: A highly threaded server-on-a-chip , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[40] Katherine A. Yelick,et al. Evaluating support for global address space languages on the Cray X1 , 2004, ICS '04.

[41] Josep Torrellas,et al. A Near-Memory Processor for Vector, Streaming and Bit Manipulation Workloads , 2005 .

[42] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[43] David A. Patterson,et al. Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[44] Lixin Zhang,et al. Mambo: a full system simulator for the PowerPC architecture , 2004, PERV.

[45] Mateo Valero,et al. Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[46] Edmund L. Wong,et al. Polymorphous Computing Architecture (PCA) Kernel-Level Benchmarks , 2005 .

[47] C. Lemuet,et al. The Potential Energy Efficiency of Vector Acceleration , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[48] P. Brandimarte. Finite Difference Methods for Partial Differential Equations , 2006 .

[49] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.

[50] Steve Pawlowski. Petascale Computing Research Challenges - A Manycore Perspective , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[51] Mahmut T. Kandemir,et al. Hardware and Software Techniques for Controlling DRAM Power Modes , 2001, IEEE Trans. Computers.

[52] Sadaf R. Alam,et al. Early evaluation of the Cray XT3 , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[53] Eric Rotenberg,et al. A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[54] Larry Rudolph,et al. Distributed hierarchical control for parallel processing , 1990, Computer.

[55] Dirk Grunwald,et al. Pipeline gating: speculation control for energy reduction , 1998, ISCA.

[56] David A. Patterson,et al. Scalable Vector Media-processors for Embedded Systems , 2002 .

[57] Mateo Valero,et al. Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[58] Brad Calder,et al. Predictor-directed stream buffers , 2000, MICRO 33.

[59] Gary S. Tyson,et al. On high-bandwidth data cache design for multi-issue processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[60] Samuel Williams,et al. Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture , 2009, ARCS.

[61] Brian B. Moore,et al. The IBM System/370 Vector Architecture: Design Considerations , 1988, IEEE Trans. Computers.

[62] J. Little. A Proof for the Queuing Formula: L = λW , 1961 .

[63] Mateo Valero,et al. A performance study of out-of-order vector architectures and short registers , 1998, ICS '98.

[64] James E. Smith,et al. Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[65] David A. Patterson,et al. Latency lags bandwith , 2004, CACM.

[66] Dileep Bhandarkar,et al. VAX vector architecture , 1990, ISCA '90.

[67] Corinna G. Lee,et al. Initial results on the performance and cost of vector microprocessors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[68] Sally A. McKee,et al. Reflections on the memory wall , 2004, CF '04.

[69] Vladimir M. Pentkovski,et al. Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[70] Eric M. Schwarz,et al. IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[71] Stephen Phillips. VictoriaFalls: Scaling highly-threaded processor cores , 2007, 2007 IEEE Hot Chips 19 Symposium (HCS).

[72] Richard E. Kessler,et al. Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[73] John Wawrzynek,et al. Vector microprocessors , 1998 .

[74] Wilson C. Hsieh,et al. Impulse: Memory system support for scientific applications , 1999, Sci. Program..

[75] Mateo Valero,et al. Vector architectures: past, present and future , 1998, ICS '98.

[76] D C LittleJohn. A Proof for the Queuing Formula , 1961 .

[77] Christopher Batten,et al. The Vector-Thread Architecture , 2004, ISCA 2004.

[78] Christoforos Kozyrakis,et al. A Media-Enhanced Vector Architecture for Embedded Memory Systems , 1999 .

[79] Matthew Mattina,et al. Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[80] Avi Mendelson,et al. Micro-operation cache: a power aware frontend for variable instruction length ISA , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[81] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[82] Margaret Martonosi,et al. Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[83] David A. Patterson,et al. Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[84] Uri C. Weiser,et al. MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[85] Yuen H. Chan,et al. IBM POWER6 SRAM arrays , 2007, IBM J. Res. Dev..

[86] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[87] Norman P. Jouppi,et al. Fast synchronization for chip multiprocessors , 2005, CARN.

[88] Hiroshi Nakamura,et al. Evaluation of pseudo vector processor based on slide-windowed registers , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[89] I. Duff,et al. Direct Methods for Sparse Matrices , 1987 .

[90] Mateo Valero,et al. Multithreaded vector architectures , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.