Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. We introduce Maven, a new VT microarchitecture based on the traditional vector-SIMD microarchitecture, that is considerably simpler to implement and easier to program than previous VT designs. Using an extensive design-space exploration of full VLSI implementations of many accelerator design points, we evaluate the varying tradeoffs between programmability and implementation efficiency among the MIMD, vector-SIMD, and VT patterns on a workload of compiled microbenchmarks and application kernels. We find the vector cores provide greater efficiency than the MIMD cores, even on fairly irregular kernels. Our results suggest that the Maven VT microarchitecture is superior to the traditional vector-SIMD architecture, providing both greater efficiency and easier programmability.

[1]  Vladimir M. Pentkovski,et al.  Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[2]  John Goodacre,et al.  Parallelism and the ARM instruction set architecture , 2005, Computer.

[3]  Karthikeyan Sankaralingam,et al.  Universal Mechanisms for Data-Parallel Architectures , 2003, MICRO.

[4]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[5]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[6]  Michael J. Flynn,et al.  Very high-speed computing systems , 1966 .

[7]  Krste Asanovic,et al.  Compiling for vector-thread architectures , 2008, CGO '08.

[8]  Ashok Kumar,et al.  An 8-Core 64-Thread 64b Power-Efficient SPARC SoC , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[9]  Philipp Slusallek,et al.  RPU: a programmable ray processing unit for realtime ray tracing , 2005, ACM Trans. Graph..

[10]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[11]  Werner Buchholz The IBM System/370 Vector Architecture , 1986, IBM Syst. J..

[12]  Christopher Batten,et al.  Cache Refill/Access Decoupling for Vector Machines , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[13]  Ozalp Babaoglu,et al.  ACM Transactions on Computer Systems , 2007 .

[14]  Marc Tremblay,et al.  VIS speeds new media processing , 1996, IEEE Micro.

[15]  Corinna G. Lee,et al.  A Vectorizing SUIF Compiler , 1997 .

[16]  Samuel Williams,et al.  Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures , 2008 .

[17]  Sanjay J. Patel,et al.  A Task-Centric Memory Model for Scalable Accelerator Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[18]  Sanjay J. Patel,et al.  Tradeoffs in designing accelerator architectures for visual computing , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[19]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[20]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[21]  Christoforos E. Kozyrakis,et al.  Vector Lane Threading , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[22]  Hiroaki Kobayashi,et al.  Performance evaluation of NEC SX-9 using real science and engineering applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[23]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[24]  P. Slusallek,et al.  RPU: a programmable ray processing unit for realtime ray tracing , 2005, SIGGRAPH '05.

[25]  Christopher Batten,et al.  The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[26]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[27]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[28]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[29]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[30]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[31]  Christopher Batten,et al.  Implementing the scale vector-thread processor , 2008, TODE.

[32]  Hiroshi Tamura,et al.  FACOM VP-100/200: Supercomputers with ease of use , 1985, Parallel Comput..

[33]  Ronny Krashinsky Vector-thread architecture and implementation , 2007 .

[34]  Guy E. Blelloch,et al.  Radix sort for vector multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[35]  Christopher Batten,et al.  Simplified vector-thread architectures for flexible and efficient data-parallel accelerators , 2010 .

[36]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[37]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[38]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[39]  H. P. Peterson,et al.  A functional description of the Lincoln TX-2 computer , 1957, IRE-AIEE-ACM '57 (Western).

[40]  Ruby B. Lee Subword parallelism with MAX-2 , 1996, IEEE Micro.

[41]  Tor M. Aamodt,et al.  Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[42]  Yunsup Lee Efficient VLSI Implementations of Vector-Thread Architectures , 2011 .

[43]  Mateo Valero,et al.  Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[44]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[45]  James E. Smith,et al.  Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[46]  Brian Kingsbury,et al.  Spert-II: A Vector Microprocessor System , 1996, Computer.

[47]  John Wawrzynek,et al.  Vector microprocessors , 1998 .

[48]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[49]  Steve Scott,et al.  The Cray BlackWidow: a highly scalable vector multiprocessor , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).