Near-Memory Acceleration for Radio Astronomy

Processing-in-memory and near-memory computing have recently been rediscovered as a way to alleviate the “memory wall problem” of traditional computing architectures. In this paper, we discuss the implementation of a 3D-stacked near-memory accelerator, targeting radio astronomy and scientific applications. After exploring the design space of the architecture by focusing on minimizing the execution power of the processing pipeline of the SKA1-Low central signal processor, we show that our accelerator can achieve an energy efficiency of up to 390 GFLOPS/W, corresponding to an energy consumption one order of magnitude lower than alternative state-of-the-art implementations. When running additional mathematical and streaming-oriented kernels, our accelerator achieves from 6.4<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="fiorin-ieq1-2748580.gif"/></alternatives></inline-formula> to 20<inline-formula> <tex-math notation="LaTeX">$\times$</tex-math><alternatives><inline-graphic xlink:href="fiorin-ieq2-2748580.gif"/> </alternatives></inline-formula> energy efficiency improvement compared to alternative solutions.

[1]  Thomas Ilsche,et al.  An Energy Efficiency Feature Survey of the Intel Haswell Processor , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[2]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[3]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[4]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[5]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[6]  Li-Shiuan Peh,et al.  A low-swing crossbar and link generator for low-power networks-on-chip , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[7]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[8]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[9]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[10]  Pedro Trancoso Moving to memoryland: in-memory computation for existing applications , 2015, Conf. Computing Frontiers.

[11]  R. Jongerius,et al.  End-to-end compute model of the Square Kilometre Array , 2014 .

[12]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13]  Christoph Hagleitner,et al.  Exploring the Design Space of an Energy-Efficient Accelerator for the SKA1-Low Central Signal Processor , 2016, International Journal of Parallel Programming.

[14]  John W. Romein A Comparison of Accelerator Architectures for Radio-Astronomical Signal-Processing Algorithms , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[15]  Christoph Hagleitner,et al.  An energy-efficient custom architecture for the SKA1-low central signal processor , 2015, Conf. Computing Frontiers.

[16]  Rob van Nieuwpoort,et al.  Correlating Radio Astronomy Signals with Many-Core Hardware , 2011, International Journal of Parallel Programming.

[17]  Michael A. Clark,et al.  Accelerating radio astronomy cross-correlation with graphics processing units , 2011, Int. J. High Perform. Comput. Appl..

[18]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[19]  Reetuparna Das,et al.  Exploring specialized near-memory processing for data intensive operations , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[20]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[21]  J. Cordes The Square Kilometer Array , 2006 .

[22]  Luca Benini,et al.  High performance AXI-4.0 based interconnect for extensible smart memory cubes , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[23]  Stylianos Perissakis,et al.  The Energy Efficiency Of Iram Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[24]  James R. Geraci,et al.  A transpose-free in-place SIMD optimized FFT , 2012, TACO.

[25]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[26]  John D. Bunton,et al.  A Radio Astronomy Correlator Optimized for the Xilinx Virtex-4 SX FPGA , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[27]  Henk Corporaal,et al.  An End-to-End Computing Model for the Square Kilometre Array , 2014, Computer.

[28]  Christoph Hagleitner,et al.  Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Bruno Schulze,et al.  High Performance Computing Evaluation A methodology based on Scientific Application Requirements , 2014, ArXiv.

[30]  Christoph Hagleitner,et al.  Challenges in exascale radio astronomy: Can the SKA ride the technology wave? , 2015, Int. J. High Perform. Comput. Appl..

[31]  Jan van Lunteren A novel processor architecture for high-performance stream processing , 2006 .

[32]  Himanshu Kaul,et al.  16.1 A 340mV-to-0.9V 20.2Tb/s source-synchronous hybrid packet/circuit-switched 16×16 network-on-chip in 22nm tri-gate CMOS , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[33]  裕幸 飯田,et al.  International Technology Roadmap for Semiconductors 2003の要求清浄度について - シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について - , 2004 .

[34]  Mark Horowitz,et al.  Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[35]  Steven Swanson,et al.  Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.