An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

In order to guarantee both performance and programmability demands in 3D graphics applications, vector and multithreaded SIMD architectures have been employed in recent graphics processing units. This paper introduces a novel instruction-systolic array architecture, which transfers an instruction stream in a pipelined fashion to efficiently share the expensive functional resources of a graphics processor. Specifically, cache misses and dynamic branches can cause additional latencies and complicated management in these parallel architectures. To address this problem, we combine a systolic execution scheme with on-demand warp activation that handles cache miss latency and branch divergence efficiently without significantly increasing hardware resources, either in terms of logic or register space. Simulation indicates that the proposed architecture offers 25% better performance than a traditional SIMD architecture with the same resources, and requires significantly fewer resources to match the performance of a typical modern vector multi-threaded GPU architecture.

[1]  David Geer Building converged networks with IMS technology , 2005, Computer.

[2]  Scott A. Mahlke,et al.  Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[3]  Christoforos E. Kozyrakis,et al.  Overcoming the limitations of conventional vector processors , 2003, ISCA '03.

[4]  David Geer Taking the graphics processor beyond graphics , 2005, Computer.

[5]  Henry P. Moreton,et al.  The GeForce 6800 , 2005, IEEE Micro.

[6]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[8]  H. T. Kung Why systolic architectures? , 1982, Computer.

[9]  Randima Fernando,et al.  The CG Tutorial: The Definitive Guide to Programmable Real-Time Graphics , 2003 .

[10]  Christopher Batten,et al.  The Vector-Thread Architecture , 2004, ISCA 2004.

[11]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[12]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[13]  Wen-mei W. Hwu Illinois ECE 498AL: Programming Massively Parallel Processors, Lecture 12: Structuring Parallel Algorithms , 2009 .

[14]  Anoop Gupta,et al.  Interleaving: a multithreading technique targeting multiprocessors and workstations , 1994, ASPLOS VI.

[15]  Brad Calder,et al.  Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[16]  Erik Lindholm,et al.  A user-programmable vertex engine , 2001, SIGGRAPH.

[17]  Lee-Sup Kim,et al.  An Energy-Efficient Mobile Vertex Processor With Multithread Expanded VLIW Architecture and Vertex Caches , 2007, IEEE Journal of Solid-State Circuits.