论文信息 - Data Management: The Spirit to Pursuit Peak Performance on Many-Core Processor

Data Management: The Spirit to Pursuit Peak Performance on Many-Core Processor

to date, most of many-core prototypes employ tiled topologies connected through on-chip networks. The throughput and latency of the on-chip networks usually become to the bottleneck to achieve peak performance especially for communication intensive applications. Most of studies are focus on on-chip networks only, such as routing algorithms or router micro-architecture, to improve the above metrics. The salient aspect of our approach is that we provide a data management framework to implement high efficient on-chip traffic based on overall many-core system. The major contributions of this paper include that: (1) providing a novel tiled many-core architecture which supports software controlled on-chip data storage and movement management; (2) identifying that the asynchronous bulk data transfer mechanism is an effective method to tolerant the latency of 2-D mesh on-chip networks. At last, we evaluate the 1-D FFT algorithm on the framework and the performance achieves 47.6 Gflops with 24.8% computation efficiency.

[1] Henry Hoffmann,et al. On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[2] Akif Ali,et al. Near-optimal worst-case throughput routing for two-dimensional mesh networks , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[3] Saurabh Dighe,et al. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[4] William J. Dally,et al. Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[5] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[6] H. Peter Hofstee,et al. Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[7] Guang R. Gao,et al. Optimizing the Fast Fourier Transform on a Multi-core Architecture , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[8] Per Stenström,et al. An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[9] M. Suzuoki,et al. Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor , 2006, IEEE Journal of Solid-State Circuits.

[10] Sangyeun Cho,et al. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11] Timothy Mark Pinkston,et al. Characterizing the Cell EIB On-Chip Network , 2007, IEEE Micro.

[12] Liviu Iftode,et al. Scope Consistency: A Bridge between Release Consistency and Entry Consistency , 1996, SPAA '96.

[13] Timothy Mark Pinkston,et al. On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[14] Samuel Williams,et al. Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[15] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[16] W. Dally. Interconnect-limited VLSI architecture , 1999, Proceedings of the IEEE 1999 International Interconnect Technology Conference (Cat. No.99EX247).

[17] Naga K. Govindaraju,et al. High performance discrete Fourier transforms on graphics processors , 2008, HiPC 2008.

[18] Axel Jantsch,et al. Network on Chip : An architecture for billion transistor era , 2000 .

[19] Guang R. Gao,et al. Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture , 2006, 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS'06).

[20] William J. Dally,et al. Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[21] M. Puschel,et al. FFT Program Generation for Shared Memory: SMP and Multicore , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[22] S. Lennart Johnsson,et al. Scheduling FFT computation on SMP and multicore systems , 2007, ICS '07.

[23] David H. Bailey,et al. FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).