论文信息 - Near-Memory Acceleration for Radio Astronomy

Near-Memory Acceleration for Radio Astronomy

Processing-in-memory and near-memory computing have recently been rediscovered as a way to alleviate the “memory wall problem” of traditional computing architectures. In this paper, we discuss the implementation of a 3D-stacked near-memory accelerator, targeting radio astronomy and scientific applications. After exploring the design space of the architecture by focusing on minimizing the execution power of the processing pipeline of the SKA1-Low central signal processor, we show that our accelerator can achieve an energy efficiency of up to 390 GFLOPS/W, corresponding to an energy consumption one order of magnitude lower than alternative state-of-the-art implementations. When running additional mathematical and streaming-oriented kernels, our accelerator achieves from 6.4<inline-formula><tex-math notation="LaTeX">$\times$</tex-math><alternatives> <inline-graphic xlink:href="fiorin-ieq1-2748580.gif"/></alternatives></inline-formula> to 20<inline-formula> <tex-math notation="LaTeX">$\times$</tex-math><alternatives><inline-graphic xlink:href="fiorin-ieq2-2748580.gif"/> </alternatives></inline-formula> energy efficiency improvement compared to alternative solutions.

[1] Thomas Ilsche,et al. An Energy Efficiency Feature Survey of the Intel Haswell Processor , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[2] J. Jeddeloh,et al. Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[3] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[4] Peter M. Kogge,et al. EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[5] Jack J. Dongarra,et al. An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[6] Li-Shiuan Peh,et al. A low-swing crossbar and link generator for low-power networks-on-chip , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[7] Tejas Karkhanis,et al. Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[8] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[9] Christoforos E. Kozyrakis,et al. Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[10] Pedro Trancoso. Moving to memoryland: in-memory computation for existing applications , 2015, Conf. Computing Frontiers.

[11] R. Jongerius,et al. End-to-end compute model of the Square Kilometre Array , 2014 .

[12] Feifei Li,et al. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13] Christoph Hagleitner,et al. Exploring the Design Space of an Energy-Efficient Accelerator for the SKA1-Low Central Signal Processor , 2016, International Journal of Parallel Programming.

[14] John W. Romein. A Comparison of Accelerator Architectures for Radio-Astronomical Signal-Processing Algorithms , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[15] Christoph Hagleitner,et al. An energy-efficient custom architecture for the SKA1-low central signal processor , 2015, Conf. Computing Frontiers.

[16] Rob van Nieuwpoort,et al. Correlating Radio Astronomy Signals with Many-Core Hardware , 2011, International Journal of Parallel Programming.

[17] Michael A. Clark,et al. Accelerating radio astronomy cross-correlation with graphics processing units , 2011, Int. J. High Perform. Comput. Appl..

[18] Gerhard Wellein,et al. LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[19] Reetuparna Das,et al. Exploring specialized near-memory processing for data intensive operations , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[20] Gerhard Wellein,et al. LIKWID: Lightweight Performance Tools , 2011, CHPC.

[21] J. Cordes. The Square Kilometer Array , 2006 .

[22] Luca Benini,et al. High performance AXI-4.0 based interconnect for extensible smart memory cubes , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[23] Stylianos Perissakis,et al. The Energy Efficiency Of Iram Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[24] James R. Geraci,et al. A transpose-free in-place SIMD optimized FFT , 2012, TACO.

[25] Jung Ho Ahn,et al. A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[26] John D. Bunton,et al. A Radio Astronomy Correlator Optimized for the Xilinx Virtex-4 SX FPGA , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[27] Henk Corporaal,et al. An End-to-End Computing Model for the Square Kilometre Array , 2014, Computer.

[28] Christoph Hagleitner,et al. Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[29] Bruno Schulze,et al. High Performance Computing Evaluation A methodology based on Scientific Application Requirements , 2014, ArXiv.

[30] Christoph Hagleitner,et al. Challenges in exascale radio astronomy: Can the SKA ride the technology wave? , 2015, Int. J. High Perform. Comput. Appl..

[31] Jan van Lunteren. A novel processor architecture for high-performance stream processing , 2006 .

[32] Himanshu Kaul,et al. 16.1 A 340mV-to-0.9V 20.2Tb/s source-synchronous hybrid packet/circuit-switched 16×16 network-on-chip in 22nm tri-gate CMOS , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[33] 裕幸飯田,et al. International Technology Roadmap for Semiconductors 2003の要求清浄度について－シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について－ , 2004 .

[34] Mark Horowitz,et al. Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[35] Steven Swanson,et al. Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.