Vector Microprocessors for Desktop Computing

Rapid advances in IC technology combined with recent trends in design complexity and application usage indicate that vector microprocessors can provide a scalable, cost-eeective solution for desktop computers in the near future. In this paper, we review these trends and explain how they beneet vector processors. A decentralized design coupled with a simpler in-order issue mechanism will allow vector microprocessors to achieve the higher clock frequencies enabled by future IC technology. Moreover, the simpler vector hardware ooers a favorable alternative to the increasing design sophistication of OOO superscalar processors. From the application side, it is expected that, by the year 2000, 90% of desktop cycles will be dedicated to multimedia workloads. For these applications vector microprocessors can supply up to twice the performance of today's out-of-order superscalar processors which are already aided by short-vector multimedia instructions. One potential drawback to a vector-based desktop computer is the general belief that they can improve the performance of only vectorizable programs. In this paper, we use two approaches to address the problem of non-vectorizable performance. First, we quantify the performance loss that might result and determine that the loss is not signiicant. This is because non-vectorizable applications tend to have only modest amounts of ILP. A 2-way in-order vector microprocessor should require no more than twice as many clock periods as a wider 4-way out-of-order processor to execute such applications. Some if not most of this cycle count diierence should be made up by the greater clock rate possible by the vector microprocessor's decentralized design or by identifying vectorizable loops. Second, we investigate the use of outer-loop parallelism to expand the number of loops that can be vectorized eeectively. We show that directly vectorizing outer loops ooers the greatest promise among compilation techniques that exploit outer-loop parallelism. The vector microprocessor can use outer-loop vectorization to execute select benchmarks 4 to 9 times faster than the OOO superscalar using loop interchange.

[1]  Derek J. DeVries A vectorizing SUIF compiler, implementation and performance , 1997 .

[2]  Trung A. Diep,et al.  Performance evaluation of the PowerPC 620 microarchitecture , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[3]  Christoforos E. Kozyrakis,et al.  A New Direction for Computer Architecture Research , 1998, Computer.

[4]  Dean M. Tullsen,et al.  Storageless value prediction using prior register values , 1999, ISCA.

[5]  Alvin R. Lebeck,et al.  Load latency tolerance in dynamically scheduled processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[6]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[7]  Viet Nhu Ngo Parallel loop transformation techniques for vector-based multiprocessor systems , 1995 .

[8]  Brad Calder,et al.  Instruction recycling on a multiple-path processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[9]  Norman P. Jouppi,et al.  Performance of image and video processing with general-purpose processors and media ISA extensions , 1999, ISCA.

[10]  Glenn Reinman,et al.  A scalable front-end architecture for fast instruction delivery , 1999, ISCA.

[11]  Didier Le Gall,et al.  MPEG: a video compression standard for multimedia applications , 1991, CACM.

[12]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[13]  Gurindar S. Sohi,et al.  Task selection for a multiscalar processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[14]  Glenn Reinman,et al.  Selective value prediction , 1999, ISCA.

[15]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[16]  Corinna G. Lee,et al.  Initial results on the performance and cost of vector microprocessors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[17]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[18]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[19]  Mateo Valero,et al.  Exploiting instruction- and data-level parallelism , 1997, IEEE Micro.

[20]  Marc Tremblay,et al.  The visual instruction set (VIS) in UltraSPARC , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[21]  Ruby B. Lee Subword parallelism with MAX-2 , 1996, IEEE Micro.

[22]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[23]  Doug Matzke,et al.  Will Physical Scalability Sabotage Performance Gains? , 1997, Computer.

[24]  Hiromasa Takahashi,et al.  The mu VP 64-bit vector coprocessor: a new implementation of high-performance numerical computation , 1993, IEEE Micro.

[25]  John Paul Shen,et al.  The block-based trace cache , 1999, ISCA.

[26]  Corinna G. Lee,et al.  Simple vector microprocessors for multimedia applications , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[27]  Krste Asanovic,et al.  Torrent Architecture Manual , 1997 .

[28]  D. Burger,et al.  Billion-Transistor Architectures , 1997, Computer.

[29]  Gary S. Tyson,et al.  The limits of instruction level parallelism in SPEC95 applications , 1999, CARN.

[30]  Margaret Martonosi,et al.  Dynamically exploiting narrow width operands to improve processor power and performance , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[31]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[32]  A. Aiken,et al.  Loop Quantization: an Analysis and Algorithm , 1987 .

[33]  Mikko H. Lipasti,et al.  Superspeculative Microarchitecture for Beyond AD 2000 , 1997, Computer.

[34]  Michael D. Smith,et al.  Geust Editorial: Media processing: a new design target , 1996, IEEE Micro.

[35]  Pradeep K. Dubey,et al.  How Multimedia Workloads Will Change Processor Design , 1997, Computer.