A Hybrid Architecture With Low Latency Interfaces Enabling Dynamic Cache Management

The main focus of the dominant technologies in the high performance computation (HPC) market, such as GPU and multicore systems, is put on processing power, while much less attention has been paid to communication delays inside hybrid architectures. To fill this gap, this paper presents an experimental study on Intel’s Broadwell Xeon multicore processor with integrated Arria 10 FPGA capabilities to characterize the communication delays between CPUs and the FPGA, using both the low latency cache coherent interface and the two PCIe links offered by this platform. The obtained results show that an FPGA cache access latency can be as low as 25 cycles at 400 MHz and that the platform is capable of reaching a bandwidth over 20 GB/s using an aggregate of the three available links. Furthermore, an FPGA-based cache management mechanism is proposed and implemented in this paper. A case study on a Merkle tree hash function shows that a hardware accelerator can achieve a fivefold data access acceleration in the worst case scenario. This scheme takes advantage of the QPI cache coherency and queuing theory to achieve a low latency and efficient memory management. In addition, design recommendations regarding the use of the CPU-FPGA platform for the implementation of fine-grained memory management schemes are suggested.

[1]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[2]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[3]  Phillip A. Laplante,et al.  Real-Time Systems Design and Analysis , 1992 .

[4]  Yoshua Bengio,et al.  BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , 2016, ArXiv.

[5]  Paul Chow Why Put FPGAs in your CPU socket? , 2013, FPT.

[6]  Marco Caccamo,et al.  A hardware architecture to deploy complex multiprocessor scheduling algorithms , 2014, 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications.

[7]  Michel Gémieux Analyse de faisabilité de l'implantation d'un protocole de communication sur processeur multicoeurs , 2015 .

[8]  Kang G. Shin,et al.  Scalable hardware earliest-deadline-first scheduler for ATM switching networks , 1997, Proceedings Real-Time Systems Symposium.

[9]  John W. Lockwood,et al.  A Low-Latency Library in FPGA Hardware for High-Frequency Trading (HFT) , 2012, 2012 IEEE 20th Annual Symposium on High-Performance Interconnects.

[10]  Gildas Genest,et al.  Nallatech In-Socket FPGA Front-Side Bus Accelerator , 2010, Computing in Science & Engineering.

[11]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[12]  Kevin Skadron,et al.  Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation , 2003, 2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003..

[13]  Jean-Christophe Prévotet,et al.  Evaluation of the overheads and latencies of a virtualized RTOS , 2013, 2013 8th IEEE International Symposium on Industrial Embedded Systems (SIES).

[14]  Yvon Savaria,et al.  A Cache-Coherent Heterogeneous Architecture for Low Latency Real Time Applications , 2017, 2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC).

[15]  Dhabaleswar K. Panda,et al.  Performance characterization and acceleration of big data workloads on OpenPOWER system , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[16]  Satoshi Nakamoto Bitcoin : A Peer-to-Peer Electronic Cash System , 2009 .

[17]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..

[18]  Jason Cong,et al.  A quantitative analysis on microarchitectures of modern CPU-FPGA platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[19]  Dong Liu,et al.  High-performance, energy-efficient platforms using in-socket FPGA accelerators , 2009, FPGA '09.

[20]  Björn Scheuermann,et al.  Bitcoin and Beyond: A Technical Survey on Decentralized Digital Currencies , 2016, IEEE Communications Surveys & Tutorials.

[21]  Alberto García Ortiz,et al.  A Scalable Hardware Implementation of a Best-Effort Scheduler for Multicore Processors , 2013, 2013 Euromicro Conference on Digital System Design.

[22]  Rabi N. Mahapatra,et al.  A Hardware Scheduler for Real Time Multiprocessor System on Chip , 2010, 2010 23rd International Conference on VLSI Design.

[23]  Kunle Olukotun,et al.  Automatic Generation of Efficient Accelerators for Reconfigurable Hardware , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[24]  Accelerator Templates and Runtime Support for Variable Precision CNN , 2017 .

[25]  Mohamed Shalan,et al.  A Configurable Hardware Scheduler for Real-Time Systems , 2003, Engineering of Reconfigurable Systems and Algorithms.

[26]  Felipe Cerqueira,et al.  A Comparison of Scheduling Latency in Linux, PREEMPT-RT, and LITMUS RT , 2013 .

[27]  Norman P. Jouppi,et al.  Readings in computer architecture , 2000 .

[28]  Martin Margala,et al.  Application of convolutional neural networks on Intel® Xeon® processor with integrated FPGA , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[29]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).