Universal mechanisms for data-parallel architectures

Data-parallel programs are both growing in importance and increasing in diversity, resulting in specialized processors targeted at specific classes of these programs. This paper presents a classification scheme for data-parallel program attributes, and proposes micro-architectural mechanisms to support applications with diverse behavior for using a single reconfigurable architecture. We focus on the following four broad kinds of data-parallel programs - DSP/multimedia, scientific, networking, and real-time graphics workloads. While all of these programs exhibit high computational intensity, coarse-grain regular control behavior, and some regular memory access behavior, they show wide variance in the computation requirements, fine grain control behavior, and the frequency of other types of memory accesses. Based on this study of application attributes, this paper proposes a set of general micro-architectural mechanisms that enable a baseline architecture to be dynamically tailored to the demands of a particular application. These mechanisms provide efficient execution across a spectrum of data-parallel application and can be applied to diverse architectures ranging from vector cores to conventional superscalar cores. Our results using a baseline TRIPS processor show that the configurability of the architecture to the application demands provides harmonic mean performance improvement of 5%-55% over scalable yet less flexible architectures, and performs competitively against specialized architectures.

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[3]  Brian B. Moore,et al.  Concepts of the System/370 vector architecture , 1987, ISCA '87.

[4]  Shekhar Y. Borkar,et al.  iWarp: an integrated solution to high-speed parallel computing , 1988, Proceedings. SUPERCOMPUTING '88.

[5]  Tom Blank,et al.  The MasPar MP-1 architecture , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[6]  Kurt Akeley,et al.  Reality Engine graphics , 1993, SIGGRAPH.

[7]  V. Michael Bove,et al.  Cheops: a reconfigurable data-flow system for video processing , 1995, IEEE Trans. Circuits Syst. Video Technol..

[8]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[9]  Paolo Faraboschi,et al.  Custom-fit processors: letting applications define architectures , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[10]  Mateo Valero,et al.  Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[11]  Anoop Gupta,et al.  The Design and Analysis of a Cache Architecture for Texture Mapping , 1997, ISCA.

[12]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[13]  James E. Smith,et al.  Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Masaaki Oka,et al.  Vector Unit Architecture for Emotion Synthesis , 2000, IEEE Micro.

[15]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[16]  W. Dally,et al.  Efficient conditional operations for data-parallel architectures , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[17]  W. Dally,et al.  Stream Scheduling , 2001 .

[18]  R. Nagarajan,et al.  A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[19]  Todd M. Austin,et al.  CryptoManiac: a fast flexible architecture for secure communication , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[20]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[21]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[22]  Matthew Mattina,et al.  Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[23]  Joseph R. Cavallaro,et al.  A programmable baseband processor design for software defined radios , 2002, The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002..

[24]  Tim Olson Advanced Processing Techniques Using the Intrinsity TM FastMATH TM Processor , 2002 .

[25]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[26]  Pat Hanrahan,et al.  Data Parallel Computation on Graphics Hardware , 2003 .

[27]  William R. Mark,et al.  Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..

[28]  R. Fernando,et al.  The Cg tutorial日本語版 : プログラム可能なリアルタイムグラフィックス完全ガイド , 2003 .

[29]  Christoforos E. Kozyrakis,et al.  Overcoming the limitations of conventional vector processors , 2003, ISCA '03.