Programming the Adapteva Epiphany 64-core network-on-chip coprocessor

Energy efficiency is the primary impediment in the path to exascale computing. Consequently, the high-performance computing community is increasingly interested in low-power high-performance embedded systems as building blocks for large-scale high-performance systems. The Adapteva Epiphany architecture integrates low-power RISC cores on a 2D mesh network and promises up to 70 GFLOPS/Watt of theoretical performance. However, with just 32 KB of memory per eCore for storing both data and code, programming the Epiphany system presents significant challenges. In this paper we evaluate the performance of a 64-core Epiphany system with a variety of basic compute and communication micro-benchmarks. Further, we implemented two well known application kernels, 5-point star-shaped heat stencil with a peak performance of 65.2 GFLOPS and matrix multiplication with 65.3 GFLOPS in single precision across 64 Epiphany cores. We discuss strategies for implementing high-performance computing application kernels on such memory constrained low-power devices and compare the Epiphany with competing low-power systems. With future Epiphany revisions expected to house thousands of cores on a single chip, understanding the merits of such an architecture is of prime importance to the exascale initiative.

[1]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Jack J. Dongarra The Impact of Multicore on Math Software and Exploiting Single Precision Computing to Obtain Double Precision Results , 2006, ISPA.

[3]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[4]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[5]  Matthias S. Müller,et al.  OpenMP in the Era of Low Power Devices and Accelerators , 2013, Lecture Notes in Computer Science.

[6]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[7]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[8]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[9]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[10]  Alistair P. Rendell,et al.  OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip , 2013, IWOMP.

[11]  Jun Zhou,et al.  Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[12]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[13]  Tarek El-Ghazawi,et al.  Experiences with UPC on TILE-64 processor , 2011, 2011 Aerospace Conference.

[14]  Yaniv Sapir Adapteva Scalable Parallel Multiplication of Big Matrices , 2012 .

[15]  Bronis R. de Supinski,et al.  OpenMP for Accelerators , 2011, IWOMP.

[16]  Jaeyoung Choi,et al.  Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers , 1994, Concurr. Pract. Exp..

[17]  Robert A. van de Geijn,et al.  Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[19]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[20]  Gerhard Wellein,et al.  Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory , 2009, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[21]  Christoph Kessler,et al.  Efficient On-Chip Pipelined Streaming Computations on Scalable Manycore Architectures , 2012 .

[22]  Julien Langou,et al.  Exploiting Mixed Precision Floating Point Hardware in Scientific Computations , 2006, High Performance Computing Workshop.

[23]  Timothy G. Mattson,et al.  Programming the Intel 80-core network-on-a-chip Terascale Processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.