MPI communication on MPPA Many-core NoC: design, modeling and performance issues

Power dissipation and energy consumption has become a major issue for high performance computing and embedded systems. Keeping up with the performance trend of the last decades cannot be achieved anymore by stepping up the clock speed of processors. The usual strategy is nowadays to use lower frequency and to increase the number of cores. On such recent systems, data communication and memory bandwidth can become the main barrier, since there are more and more processing units to coordinate. In this paper, we introduce an MPI design and its implementation on the MPPA-256 (Multi Purpose Processor Array) processor from Kalray Inc., one of the first worldwide actors in the many-core architecture field. A model was developed to evaluate the communication performance and bottlenecks on MPPA. Our achieved result of 1.2 GB/s, e.g. 75% of peak throughput, for on-chip communication shows that the MPPA is a promising architecture for next-generation HPC systems, with its high performance-to-power ratio and highbandwidth network-on-chip. However, the lack of a globally addressable memory on this distributed-memory architecture still requires the developer to take care of cache coherence and to pay attention to the limited local memory space of each compute element. Keywords—Many-core, NUMA, Distributed memory, Networkon-Chip, MPI, Performance modeling, Linpack, HPL, MPPA.

[1]  Benoît Dupont de Dinechin,et al.  A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[2]  Julien Mottin,et al.  The STHORM Platform , 2014 .

[3]  Spiros N. Agathos,et al.  Deploying OpenMP on an embedded multicore accelerator , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[4]  Luca Benini,et al.  NoC synthesis flow for customized domain specific multiprocessor systems-on-chip , 2005, IEEE Transactions on Parallel and Distributed Systems.

[5]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[6]  Thomas Bemmerl,et al.  Evaluation and improvements of programming models for the Intel SCC many-core processor , 2011, 2011 International Conference on High Performance Computing & Simulation.

[7]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[8]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Dhabaleswar K. Panda,et al.  MVAPICH2-MIC: A High Performance MPI Library for Xeon Phi Clusters with InfiniBand , 2013, 2013 Extreme Scaling Workshop (xsw 2013).

[10]  Roland WestrelinLHPC,et al.  Modeling of a high speed network to maximize throughputperformance : the experience of BIP over MyrinetLoic , 1997 .

[11]  A. Leon-Garcia,et al.  Robust non-probabilistic bounds for delay and throughput in credit-based flow control , 1996, Proceedings of IEEE INFOCOM '96. Conference on Computer Communications.

[12]  Benoît Dupont de Dinechin,et al.  A Distributed Run-Time Environment for the Kalray MPPA®-256 Integrated Manycore Processor , 2013, ICCS.

[13]  Anant Agarwal,et al.  rMPI: Message Passing on Multicore Processors with On-Chip Interconnect , 2008, HiPEAC.

[14]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[15]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[16]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[17]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[18]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[19]  Bernard Tourancheau,et al.  BIP: A New Protocol Designed for High Performance Networking on Myrinet , 1998, IPPS/SPDP Workshops.

[20]  Axel Jantsch,et al.  A network on chip architecture and design methodology , 2002, Proceedings IEEE Computer Society Annual Symposium on VLSI. New Paradigms for VLSI Systems Design. ISVLSI 2002.

[21]  James Ryan Psota,et al.  rMPI : an MPI-compliant message passing library for tiled architectures , 2005 .