论文信息 - Landing stencil code on Godson-T

Landing stencil code on Godson-T

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.

Dongrui Fan | Lei Wang | Xiaobing Feng | Huimin Cui

[1] Saurabh Dighe,et al. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[2] Guang R. Gao,et al. Mapping the LU decomposition on a many-core architecture: challenges and solutions , 2009, CF '09.

[3] William J. Dally,et al. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[4] Uday Bondhugula,et al. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories , 2008, PPoPP.

[5] Guang R. Gao,et al. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[6] William J. Dally,et al. The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.

[7] Huang He. Architecture Supported Synchronization-Based Cache Coherence Protocol for Many-Core Processors , 2009 .

[8] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[9] Donald Yeung,et al. Low-Cost Support for Fine-Grain Synchronization in Multiprocessors , 1992, Multithreaded Computer Architecture.

[10] Henry P. Moreton,et al. The GeForce 6800 , 2005, IEEE Micro.

[11] Guang R. Gao,et al. Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences , 2006, Euro-Par.

[12] Burton J. Smith,et al. The architecture of HEP , 1985 .

[13] Pradeep Dubey,et al. Platform 2015: Intel ® Processor and Platform Evolution for the Next Decade , 2005 .

[14] Allan Porterfield,et al. The Tera computer system , 1990 .

[15] H. Peter Hofstee,et al. Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[16] Dongrui Fan,et al. A Performance Model of Dense Matrix Operations on Many-Core Architectures , 2008, Euro-Par.

[17] Samuel Williams,et al. Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[18] Jung Ho Ahn,et al. Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[19] Samuel Williams,et al. Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[20] Long Chen,et al. Performance Tuning of the Fast Fourier Transform on a Multi-core Architecture , 2008 .

[21] David G. Wonnacott,et al. Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[22] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[23] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[24] Chau-Wen Tseng,et al. Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[25] Sanjay V. Rajopadhye,et al. Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[26] Dominique Lavenier,et al. Efficient Parallelization of a Protein Sequence Comparison Algorithm on Manycore Architecture , 2008, 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies.

[27] Volker Strumpen,et al. The memory behavior of cache oblivious stencil computations , 2007, The Journal of Supercomputing.

[28] Guang R. Gao,et al. Experience on optimizing irregular computation for memory hierarchy in manycore architecture , 2008, ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming.

[29] William J. Dally,et al. The message-driven processor , 1992 .

[30] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[31] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[32] William J. Dally. Computer Architecture in the Many-Core Era , 2006, 2006 International Conference on Computer Design.