An efficient dataflow accelerator for scientific applications

Abstract Dataflow architecture has been proved to be promising in high-performance computing. Traditional dataflow architectures are not efficient enough in typical scientific applications such as stencil and FFT due to low utilization of function units. Based on the blocking and parallelism features of scientific applications, we design SPU, an efficient dataflow architecture for scientific applications. In SPU, dataflow graphs translated from the loop body in scientific applications are mapped to the Processing Element(PE) Array. Iterations enter the dataflow graph in pipeline during execution meanwhile three levels of parallelism are exploited to improve the utilization of function units in dataflow architectures: inner-graph parallelism, pipelining parallelism and inter graph parallelism. The experimental results show that the average energy efficiency of SPU achieves 25.97GFlops/W in 40 nm technology and the utilization of floating point function units in SPU is 2.82x that of typical dataflow architecture on average for typical scientific applications.

[1]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[2]  Steven W. Smith,et al.  The Scientist and Engineer's Guide to Digital Signal Processing , 1997 .

[3]  Zhimin Zhang,et al.  A Non-Stop Double Buffering Mechanism for Dataflow Architecture , 2017, Journal of Computer Science and Technology.

[4]  Dongrui Fan,et al.  A Pipelining Loop Optimization Method for Dataflow Architecture , 2017, Journal of Computer Science and Technology.

[5]  Avi Mendelson,et al.  The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices , 2013, 2013 Euromicro Conference on Digital System Design.

[6]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[7]  D. Oriato,et al.  Acceleration of a Meteorological Limited Area Model with Dataflow Engines , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[8]  Zhimin Zhang,et al.  POSTER: An optimization of dataflow architectures for scientific applications , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[9]  Guangwen Yang,et al.  Scaling Reverse Time Migration Performance through Reconfigurable Dataflow Engines , 2014, IEEE Micro.

[10]  Wil Plouffe,et al.  An asynchronous programming language and computing machine , 1978 .

[11]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Guang R. Gao,et al.  An Implementation of the Codelet Model , 2013, Euro-Par.

[13]  Dongrui Fan,et al.  SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[14]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[15]  Wu-chun Feng,et al.  Towards a performance-portable FFT library for heterogeneous computing , 2014, Conf. Computing Frontiers.

[16]  Benoît Meister,et al.  Runnemede: An architecture for Ubiquitous High-Performance Computing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[17]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[18]  Oliver Pell,et al.  Maximum Performance Computing with Dataflow Engines , 2012, Computing in Science & Engineering.

[19]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[20]  Randy H. Katz,et al.  A Berkeley View of Systems Challenges for AI , 2017, ArXiv.

[21]  Frederico Pratas,et al.  Accelerating the Computation of Induced Dipoles for Molecular Mechanics with Dataflow Engines , 2013, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines.

[22]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[23]  Dongrui Fan,et al.  SmarCo: An Efficient Many-Core Processor for High-Throughput Applications in Datacenters , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Zhimin Zhang,et al.  Memory partition for SIMD in streaming dataflow architectures , 2016, 2016 Seventh International Green and Sustainable Computing Conference (IGSC).

[25]  Zhimin Zhang,et al.  An Efficient Network-on-Chip Router for Dataflow Architecture , 2017, Journal of Computer Science and Technology.

[26]  Steven Swanson,et al.  The WaveScalar architecture , 2007, TOCS.

[27]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[29]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[30]  Yoav Etsion,et al.  Single-graph multiple flows: Energy efficient design alternative for GPGPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).