Efficient data streaming with on-chip accelerators: Opportunities and challenges

The transistor density of microprocessors continues to increase as technology scales. Microprocessors designers have taken advantage of the increased transistors by integrating a significant number of cores onto a single die. However, a large number of cores are met with diminishing returns due to software and hardware scalability issues and hence designers have started integrating on-chip special-purpose logic units (i.e., accelerators) that were previously available as PCI-attached units. It is anticipated that more accelerators will be integrated on-chip due to the increasing abundance of transistors and the fact that not all logic can be powered at all times due to power budget limits. Thus, on-chip accelerator architectures deserve more attention from the research community. There is a wide spectrum of research opportunities for design and optimization of accelerators. This paper attempts to bring out some insights by studying the data access streams of on-chip accelerators that hopefully foster some future research in this area. Specifically, this paper uses a few simple case studies to show some of the common characteristics of the data streams introduced by on-chip accelerators, discusses challenges and opportunities in exploiting these characteristics to optimize the power and performance of accelerators, and then analyzes the effectiveness of some simple optimizing extensions proposed.

[1]  Hao Yu,et al.  Stateful hardware decompression in networking environment , 2008, ANCS '08.

[2]  H. Franke,et al.  Introduction to the wire-speed processor and architecture , 2010, IBM J. Res. Dev..

[3]  R. Govindarajan,et al.  Emulating Optimal Replacement with a Shepherd Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[4]  Charles F. Webb IBM z10: The Next-Generation Mainframe Microprocessor , 2008, IEEE Micro.

[5]  David A. Wood,et al.  Multicast snooping: a new coherence method using a multicast address network , 1999, ISCA.

[6]  N. Gura,et al.  UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[7]  Milo M. K. Martin,et al.  Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors , 2003, ISCA '03.

[8]  Chen-Yong Cher,et al.  A wire-speed powerTM processor: 2.3GHz 45nm SOI with 16 cores and 64 threads , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[9]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  Matthias A. Blumrich,et al.  Design and implementation of the blue gene/P snoop filter , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[11]  Mikko H. Lipasti,et al.  Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[12]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[13]  Lixin Zhang,et al.  Mambo: a full system simulator for the PowerPC architecture , 2004, PERV.

[14]  Andreas Moshovos,et al.  A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Thomas F. Wenisch,et al.  Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.