Low-complexity vector microprocessor extension

For the last few years, single-thread performance has been improving at a snail’s pace. Power limitations, increasing relative memory latency, and the exhaustion of improvement in instruction-level parallelism are forcing microprocessor architects to examine new processor design strategies. In this dissertation, I take a look at a technology that can improve the efficiency of modern microprocessors: vectors. Vectors are a simple, power-efficient way to take advantage of common data-level parallelism in an extensible, easily-programmable manner. My work focuses on the process of transitioning from traditional scalar microprocessors to computers that can take advantage of vectors. First, I describe a process for extending existing single-instruction, multiple-data instruction sets to support full vector processing, in a way that remains binary compatible with existing applications. Initial implementations can be low cost, but be transparently extended to higher performance later. I also describe ViVA, the Virtual Vector Architecture. ViVA adds vector-style memory operations to existing microprocessors but does not include arithmetic datapaths; instead, memory instructions work with a new buffer placed between the core and second-level cache. ViVA serves as a low-cost solution to getting much of the performance of full vector memory hierarchies while avoiding the complexity of adding a full vector system. Finally, I test the performance of ViVA by modifying a cycle-accurate full-system simulator to support ViVA’s operation. After extensive calibration, I test the basic performance of ViVA using a series of microbenchmarks. I compare the performance of a variety of ViVA configurations for corner turn, used in processing multidimensional data, and sparse matrix-vector multiplication, used in many scientific applications. Results show that ViVA can give significant benefit for a variety of memory access patterns, without relying on a costly hardware prefetcher.

[1]  George L.-T. Chiu,et al.  Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[2]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[3]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[4]  Larry Rudolph,et al.  Gang Scheduling Performance Benefits for Fine-Grain Synchronization , 1992, J. Parallel Distributed Comput..

[5]  Tzi-cker Chiueh Multi-threaded vectorization , 1991, ISCA '91.

[6]  P.H. Worley,et al.  Early Evaluation of the Cray X1 , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[8]  Ahmed Sameh,et al.  The Illiac IV system , 1972 .

[9]  Mateo Valero,et al.  Adding a vector unit to a superscalar processor , 1999, ICS '99.

[10]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[11]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[12]  Chris R. Jesshope Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines , 2001 .

[13]  Jeffrey S. Vetter Cray X1 Evaluation Status Report , 2004 .

[14]  Guy E. Blelloch,et al.  AD-A 270 601 Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors , 1993 .

[15]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[16]  James L. Peterson,et al.  Design and validation of a performance and power simulator for PowerPC systems , 2003, IBM J. Res. Dev..

[17]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[18]  P. Colella,et al.  Local adaptive mesh refinement for shock hydrodynamics , 1989 .

[19]  Yu Zhang,et al.  Parallelization of IBM mambo system simulator in functional modes , 2008, OPSR.

[20]  Karthick Rajamani,et al.  Application of full-system simulation in exploratory system design and development , 2006, IBM J. Res. Dev..

[21]  T. Skotnicki,et al.  The end of CMOS scaling: toward the introduction of new materials and structural changes to improve MOSFET performance , 2005, IEEE Circuits and Devices Magazine.

[22]  Ronald G. Dreslinski,et al.  Analysis of hardware prefetching across virtual page boundaries , 2007, CF '07.

[23]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[24]  Scott Devine,et al.  Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[25]  Michael J. Flynn,et al.  Intrinsic multiprocessing , 1967, AFIPS '67 (Spring).

[26]  Rajeev Balasubramonian,et al.  Reducing the complexity of the register file in dynamic superscalar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[27]  Emil Talpes,et al.  Execution cache-based microarchitecture for power-efficient superscalar processors , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[28]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[29]  Larry L. Biro,et al.  Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[30]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[31]  Zhen Fang,et al.  The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[32]  R. Kumar,et al.  An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[33]  Kentaro Shimada,et al.  A superscalar RISC processor with 160 FPRs for large scale scientific processing , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[34]  W. Wasow,et al.  Finite-Difference Methods for Partial Differential Equations , 1961 .

[35]  Michael J. Flynn,et al.  Very high-speed computing systems , 1966 .

[36]  John Wawrzynek,et al.  T0: A Single-Chip Vector Microprocessor with Reconfigurable Pipelines , 1996, ESSCIRC '96: Proceedings of the 22nd European Solid-State Circuits Conference.

[37]  Werner Buchholz The IBM System/370 Vector Architecture , 1986, IBM Syst. J..

[38]  Josep Torrellas,et al.  A Brief Description of the NMP ISA and Benchmarks , 2005 .

[39]  Greg Grohoski Niagara-2: A highly threaded server-on-a-chip , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[40]  Katherine A. Yelick,et al.  Evaluating support for global address space languages on the Cray X1 , 2004, ICS '04.

[41]  Josep Torrellas,et al.  A Near-Memory Processor for Vector, Streaming and Bit Manipulation Workloads , 2005 .

[42]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[43]  David A. Patterson,et al.  Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[44]  Lixin Zhang,et al.  Mambo: a full system simulator for the PowerPC architecture , 2004, PERV.

[45]  Mateo Valero,et al.  Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[46]  Edmund L. Wong,et al.  Polymorphous Computing Architecture (PCA) Kernel-Level Benchmarks , 2005 .

[47]  C. Lemuet,et al.  The Potential Energy Efficiency of Vector Acceleration , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[48]  P. Brandimarte Finite Difference Methods for Partial Differential Equations , 2006 .

[49]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[50]  Steve Pawlowski Petascale Computing Research Challenges - A Manycore Perspective , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[51]  Mahmut T. Kandemir,et al.  Hardware and Software Techniques for Controlling DRAM Power Modes , 2001, IEEE Trans. Computers.

[52]  Sadaf R. Alam,et al.  Early evaluation of the Cray XT3 , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[53]  Eric Rotenberg,et al.  A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[54]  Larry Rudolph,et al.  Distributed hierarchical control for parallel processing , 1990, Computer.

[55]  Dirk Grunwald,et al.  Pipeline gating: speculation control for energy reduction , 1998, ISCA.

[56]  David A. Patterson,et al.  Scalable Vector Media-processors for Embedded Systems , 2002 .

[57]  Mateo Valero,et al.  Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[58]  Brad Calder,et al.  Predictor-directed stream buffers , 2000, MICRO 33.

[59]  Gary S. Tyson,et al.  On high-bandwidth data cache design for multi-issue processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[60]  Samuel Williams,et al.  Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture , 2009, ARCS.

[61]  Brian B. Moore,et al.  The IBM System/370 Vector Architecture: Design Considerations , 1988, IEEE Trans. Computers.

[62]  J. Little A Proof for the Queuing Formula: L = λW , 1961 .

[63]  Mateo Valero,et al.  A performance study of out-of-order vector architectures and short registers , 1998, ICS '98.

[64]  James E. Smith,et al.  Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[65]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[66]  Dileep Bhandarkar,et al.  VAX vector architecture , 1990, ISCA '90.

[67]  Corinna G. Lee,et al.  Initial results on the performance and cost of vector microprocessors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[68]  Sally A. McKee,et al.  Reflections on the memory wall , 2004, CF '04.

[69]  Vladimir M. Pentkovski,et al.  Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[70]  Eric M. Schwarz,et al.  IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[71]  Stephen Phillips VictoriaFalls: Scaling highly-threaded processor cores , 2007, 2007 IEEE Hot Chips 19 Symposium (HCS).

[72]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[73]  John Wawrzynek,et al.  Vector microprocessors , 1998 .

[74]  Wilson C. Hsieh,et al.  Impulse: Memory system support for scientific applications , 1999, Sci. Program..

[75]  Mateo Valero,et al.  Vector architectures: past, present and future , 1998, ICS '98.

[76]  D C LittleJohn A Proof for the Queuing Formula , 1961 .

[77]  Christopher Batten,et al.  The Vector-Thread Architecture , 2004, ISCA 2004.

[78]  Christoforos Kozyrakis,et al.  A Media-Enhanced Vector Architecture for Embedded Memory Systems , 1999 .

[79]  Matthew Mattina,et al.  Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[80]  Avi Mendelson,et al.  Micro-operation cache: a power aware frontend for variable instruction length ISA , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[81]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[82]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[83]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[84]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[85]  Yuen H. Chan,et al.  IBM POWER6 SRAM arrays , 2007, IBM J. Res. Dev..

[86]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[87]  Norman P. Jouppi,et al.  Fast synchronization for chip multiprocessors , 2005, CARN.

[88]  Hiroshi Nakamura,et al.  Evaluation of pseudo vector processor based on slide-windowed registers , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[89]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[90]  Mateo Valero,et al.  Multithreaded vector architectures , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.