Packet Processing Architecture With Off-Chip LLC Using Interleaved 3D-Stacked DRAM

The performance of packet processing applications is dependent on memory accesses speed of network systems. Table lookup requires fast memory accesses and is one of the most common processes in various packet processing applications, which can be a dominant performance bottleneck. Therefore, in Network Function Virtualization (NFV)-aware environment, on-chip fast cache memories of a CPU of general-purpose hardware become critical to achieve high performance packet processing over tens of Gbps. In addition, multiple types of applications and complex applications are executed in the same system simultaneously in carrier network systems, which require the capacity of cache memories as well. In this paper, we propose a packet processing architecture that utilizes interleaved 3 Dimensional (3D)-stacked Dynamic Random Access Memory (DRAM) devices as off-chip Last Level Cache (LLC) in addition to several levels of dedicated cache memories of each CPU core. Entries of a lookup table are distributed in every bank and vaults to utilize both bank interleaving and vault-level memory access parallelism. Frequently accessed entries in 3D-stacked DRAM are also cached in dedicated on-chip cache memories of each CPU core. The evaluation results show that the proposed architecture reduces the memory access latency by 57 % and increases the throughput by 100 % with reducing blocking probability about 10 % compared to the conventional architecture with common on-chip LLC. These results indicate that 3D-stacked DRAM can be practical as off-chip LLC in parallel packet processing running on multiple CPU cores simultaneously.

[1]  Geoffrey Elliott,et al.  Packet Matching on FPGAs Using HMC Memory: Towards One Million Rules , 2017, FPGA.

[2]  Eriko Nurvitadhi,et al.  A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study , 2018, FPGA.

[3]  Eiji Oki,et al.  Carrier-Scale Packet Processing System Using Interleaved 3D-Stacked DRAM , 2018, 2018 IEEE International Conference on Communications (ICC).

[4]  Elkin Garcia,et al.  A Reconfigurable Computing System Based on a Cache-Coherent Fabric , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[5]  KyoungSoo Park,et al.  PacketShader: Massively Parallel Packet Processing with GPUs to Accelerate Software Routers , 2010, NSDI 2010.

[6]  Radu Marculescu,et al.  On-chip traffic modeling and synthesis for MPEG-2 video applications , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Li-Shiuan Peh,et al.  A Statistical Traffic Model for On-Chip Interconnection Networks , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[8]  Yuki Kobayashi,et al.  Accelerating NFV application using CPU-FPGA tightly coupled architecture , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[9]  Tzi-cker Chiueh,et al.  High-performance IP routing table lookup using CPU caching , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[10]  Katerina J. Argyraki,et al.  RouteBricks: exploiting parallelism to scale software routers , 2009, SOSP '09.

[11]  Hirochika Asai,et al.  Poptrie: A Compressed Trie with Population Count for Fast and Scalable Software IP Routing Table Lookup , 2015, SIGCOMM.

[12]  Li Xiao,et al.  DiCAS: An Efficient Distributed Caching Mechanism for P2P Systems , 2006, IEEE Transactions on Parallel and Distributed Systems.

[13]  Donald A. Calahan,et al.  Models of Access Delays in Multiprocessor Memories , 1992, IEEE Trans. Parallel Distributed Syst..

[14]  Sujit Dey,et al.  Evaluation of the traffic-performance characteristics of system-on-chip communication architectures , 2001, VLSI Design 2001. Fourteenth International Conference on VLSI Design.

[15]  Ki-Seok Chung,et al.  CasHMC: A Cycle-Accurate Simulator for Hybrid Memory Cube , 2017, IEEE Computer Architecture Letters.

[16]  Nick McKeown,et al.  Routing lookups in hardware at memory access speeds , 1998, Proceedings. IEEE INFOCOM '98, the Conference on Computer Communications. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Gateway to the 21st Century (Cat. No.98.

[17]  Vivek S. Pai,et al.  Towards understanding modern web traffic , 2011, SIGMETRICS '11.

[18]  Edith Cohen,et al.  Proactive caching of DNS records: addressing a performance bottleneck , 2003, Comput. Networks.

[19]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[20]  Robert Tappan Morris,et al.  DNS performance and the effectiveness of caching , 2001, IMW '01.