An FPGA-based In-Line Accelerator for Memcached

We present a method for accelerating server applications using a hybrid CPU+FPGA architecture and demonstrate its advantages by accelerating Memcached, a distributed key-value system. The accelerator, implemented on the FPGA fabric, processes request packets directly from the network, avoiding the CPU in most cases. The accelerator is created by profiling the application to determine the most commonly executed trace of basic blocks which are then extracted. Traces are executed speculatively within the FPGA. If the control flow exits the trace prematurely, the side effects of the computation are rolled back and the request packet is passed to the CPU. When compared to the best reported software numbers, the Memcached accelerator is 9.15× more energy efficient for common case requests.

[1]  Gordon J. Brebner Single-chip gigabit mixed-version IP router on Virtex-II Pro , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[2]  Mike O'Connor,et al.  Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[3]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[4]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[5]  Eitan Frachtenberg,et al.  Many-core key-value store , 2011, 2011 International Green Computing Conference and Workshops.

[6]  Yu Wang,et al.  FPMR: MapReduce framework on FPGA , 2010, FPGA '10.

[7]  Amin Ansari,et al.  Bundled execution of recurring traces for energy-efficient general purpose processing , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[10]  Muli Ben-Yehuda,et al.  Loosely Coupled TCP Acceleration Architecture , 2006, 14th IEEE Symposium on High-Performance Interconnects (HOTI'06).

[11]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[12]  H. Franke,et al.  Introduction to the wire-speed processor and architecture , 2010, IBM J. Res. Dev..

[13]  Nicholas Nethercote,et al.  Using Valgrind to Detect Undefined Value Errors with Bit-Precision , 2005, USENIX Annual Technical Conference, General Track.

[14]  Martin Margala,et al.  An FPGA memcached appliance , 2013, FPGA '13.

[15]  Derek Chiou,et al.  Compiling high throughput network processors , 2012, FPGA '12.

[16]  Andreas Koch,et al.  A Dynamically Reconfigured Network Platform for High-Speed Malware Collection , 2010, 2010 International Conference on Reconfigurable Computing and FPGAs.

[17]  Vikram Bhatt,et al.  The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.