APUNet: Revitalizing GPU as Packet Processing Accelerator

Many research works have recently experimented with GPU to accelerate packet processing in network applications. Most works have shown that GPU brings a significant performance boost when it is compared to the CPUonly approach, thanks to its highly-parallel computation capacity and large memory bandwidth. However, a recent work argues that for many applications, the key enabler for high performance is the inherent feature of GPU that automatically hides memory access latency rather than its parallel computation power. It also claims that CPU can outperform or achieve a similar performance as GPU if its code is re-arranged to run concurrently with memory access, employing optimization techniques such as group prefetching and software pipelining. In this paper, we revisit the claim of the work and see if it can be generalized to a large class of network applications. Our findings with eight popular algorithms widely used in network applications show that (a) there are many compute-bound algorithms that do benefit from the parallel computation capacity of GPU while CPU-based optimizations fail to help, and (b) the relative performance advantage of CPU over GPU in most applications is due to data transfer bottleneck in PCIe communication of discrete GPU rather than lack of capacity of GPU itself. To avoid the PCIe bottleneck, we suggest employing integrated GPU in recent APU platforms as a cost-effective packet processing accelerator. We address a number of practical issues in fully exploiting the capacity of APU and show that network applications based on APU achieve multi-10 Gbps performance for many compute/memory-intensive algorithms.

[1]  Jaejin Lee,et al.  PIPSEA: A Practical IPsec Gateway on Embedded APUs , 2016, CCS.

[2]  Sungryoul Lee,et al.  Kargus: a highly-scalable software-based intrusion detection system , 2012, CCS.

[3]  EDDIE KOHLER,et al.  The click modular router , 2000, TOCS.

[4]  Sotiris Ioannidis,et al.  MIDeA: a multi-parallel intrusion detection architecture , 2011, CCS '11.

[5]  Dongsu Han,et al.  DFC: Accelerating String Pattern Matching for Network Applications , 2016, NSDI.

[6]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[7]  Dong Zhou,et al.  Raising the Bar for Using GPUs in Software Packet Processing , 2015, NSDI.

[8]  Ben Sander,et al.  Applying AMD's Kaveri APU for heterogeneous computing , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[9]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[10]  Sotiris Ioannidis,et al.  GASPP: A GPU-Accelerated Stateful Packet Processing Framework , 2014, USENIX Annual Technical Conference.

[11]  Srinivasan Parthasarathy,et al.  Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Sue B. Moon,et al.  NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors , 2015, EuroSys.

[13]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Ren Wang,et al.  Exploiting integrated GPUs for network packet processing workloads , 2016, 2016 IEEE NetSoft Conference and Workshops (NetSoft).

[15]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[16]  Katerina J. Argyraki,et al.  RouteBricks: exploiting parallelism to scale software routers , 2009, SOSP '09.

[17]  Sangjin Han,et al.  PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[18]  Seungyeop Han,et al.  SSLShader: Cheap SSL Acceleration with Commodity Processors , 2011, NSDI.

[19]  Eric P. Xing,et al.  GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[20]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[22]  Yongge Wang,et al.  Public Key Cryptography Standards: PKCS , 2012, ArXiv.

[23]  Martin Roesch,et al.  Snort - Lightweight Intrusion Detection for Networks , 1999 .

[24]  Sue B. Moon,et al.  The power of batching in the Click modular router , 2012, APSys.