Landing stencil code on Godson-T

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.

[1]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[2]  Guang R. Gao,et al.  Mapping the LU decomposition on a many-core architecture: challenges and solutions , 2009, CF '09.

[3]  William J. Dally,et al.  Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[4]  Uday Bondhugula,et al.  Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories , 2008, PPoPP.

[5]  Guang R. Gao,et al.  Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[6]  William J. Dally,et al.  The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.

[7]  Huang He Architecture Supported Synchronization-Based Cache Coherence Protocol for Many-Core Processors , 2009 .

[8]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[9]  Donald Yeung,et al.  Low-Cost Support for Fine-Grain Synchronization in Multiprocessors , 1992, Multithreaded Computer Architecture.

[10]  Henry P. Moreton,et al.  The GeForce 6800 , 2005, IEEE Micro.

[11]  Guang R. Gao,et al.  Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences , 2006, Euro-Par.

[12]  Burton J. Smith,et al.  The architecture of HEP , 1985 .

[13]  Pradeep Dubey,et al.  Platform 2015: Intel ® Processor and Platform Evolution for the Next Decade , 2005 .

[14]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[15]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[16]  Dongrui Fan,et al.  A Performance Model of Dense Matrix Operations on Many-Core Architectures , 2008, Euro-Par.

[17]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[18]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[19]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[20]  Long Chen,et al.  Performance Tuning of the Fast Fourier Transform on a Multi-core Architecture , 2008 .

[21]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[22]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[23]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[24]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[25]  Sanjay V. Rajopadhye,et al.  Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[26]  Dominique Lavenier,et al.  Efficient Parallelization of a Protein Sequence Comparison Algorithm on Manycore Architecture , 2008, 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies.

[27]  Volker Strumpen,et al.  The memory behavior of cache oblivious stencil computations , 2007, The Journal of Supercomputing.

[28]  Guang R. Gao,et al.  Experience on optimizing irregular computation for memory hierarchy in manycore architecture , 2008, ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming.

[29]  William J. Dally,et al.  The message-driven processor , 1992 .

[30]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[31]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  William J. Dally Computer Architecture in the Many-Core Era , 2006, 2006 International Conference on Computer Design.