Data Management: The Spirit to Pursuit Peak Performance on Many-Core Processor

to date, most of many-core prototypes employ tiled topologies connected through on-chip networks. The throughput and latency of the on-chip networks usually become to the bottleneck to achieve peak performance especially for communication intensive applications. Most of studies are focus on on-chip networks only, such as routing algorithms or router micro-architecture, to improve the above metrics. The salient aspect of our approach is that we provide a data management framework to implement high efficient on-chip traffic based on overall many-core system. The major contributions of this paper include that: (1) providing a novel tiled many-core architecture which supports software controlled on-chip data storage and movement management; (2) identifying that the asynchronous bulk data transfer mechanism is an effective method to tolerant the latency of 2-D mesh on-chip networks. At last, we evaluate the 1-D FFT algorithm on the framework and the performance achieves 47.6 Gflops with 24.8% computation efficiency.

[1]  Henry Hoffmann,et al.  On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[2]  Akif Ali,et al.  Near-optimal worst-case throughput routing for two-dimensional mesh networks , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[3]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[4]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[5]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[6]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[7]  Guang R. Gao,et al.  Optimizing the Fast Fourier Transform on a Multi-core Architecture , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[8]  Per Stenström,et al.  An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[9]  M. Suzuoki,et al.  Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor , 2006, IEEE Journal of Solid-State Circuits.

[10]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11]  Timothy Mark Pinkston,et al.  Characterizing the Cell EIB On-Chip Network , 2007, IEEE Micro.

[12]  Liviu Iftode,et al.  Scope Consistency: A Bridge between Release Consistency and Entry Consistency , 1996, SPAA '96.

[13]  Timothy Mark Pinkston,et al.  On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[14]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[15]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[16]  W. Dally Interconnect-limited VLSI architecture , 1999, Proceedings of the IEEE 1999 International Interconnect Technology Conference (Cat. No.99EX247).

[17]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, HiPC 2008.

[18]  Axel Jantsch,et al.  Network on Chip : An architecture for billion transistor era , 2000 .

[19]  Guang R. Gao,et al.  Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture , 2006, 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS'06).

[20]  William J. Dally,et al.  Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[21]  M. Puschel,et al.  FFT Program Generation for Shared Memory: SMP and Multicore , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[22]  S. Lennart Johnsson,et al.  Scheduling FFT computation on SMP and multicore systems , 2007, ICS '07.

[23]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).