Affinity-aware DMA buffer management for reducing off-chip memory access

It is well recognized that moving I/O data in/out memory has become critical for high bandwidth devices. Specifically, the embedded system, with limited cache size and simple architecture, consumes a large amount of CPU cycles for off-chip memory access. The work presented in this paper addresses this problem through an Affinity-aware DMA Buffer management strategy, called ADB, requiring no change to underlying hardware. We introduce the concept of buffer affinity describes the data location of the recently released DMA buffer in the memory hierarchy. The more data in cache, the higher affinity the buffer has. Based on the character of the embedded system, we can identify buffer affinity at runtime. Using this online profiling, ADB allocates buffer with different affinity. For output processes, ADB allocates the high affinity buffer to reduce off-chip memory access when OS copies data from the user buffer to the kernel buffer. For input processes, ADB allocates the low affinity buffer to skip part of cache invalidation operations for maintaining I/O coherence. Measurements show that ADB, implemented in the Linux-2.6.32 kernel and running on a 1GHz UniCore-2 processor, improves the performance of network related programs from 5.2% to 8.8%.

[1]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[2]  Hsiao-Keng Jerry Chu,et al.  Zero-Copy TCP in Solaris , 1996, USENIX Annual Technical Conference.

[3]  Matthias Kaiserswerth,et al.  The Parallel Protocol Engine , 1993, TNET.

[4]  Ram Huggahalli,et al.  Direct cache access for high bandwidth network I/O , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[5]  Michael Stumm,et al.  Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[6]  Thomas B. Berg Maintaining I/O Data Coherence in Embedded Multicore Systems , 2009, IEEE Micro.

[7]  Xu Cheng,et al.  Using Uncacheable Memory to Improve Unity Linux Performance , 2005 .

[8]  Xiao Chen An Improved Method of Zero-Copy Data Transmission in the High Speed Network Environment , 2009, 2009 International Conference on Multimedia Information Networking and Security.

[9]  Rolf Riesen,et al.  Cache injection for parallel applications , 2011, HPDC '11.

[10]  Feng Liu,et al.  Research Progress of UniCore CPUs and PKUnity SoCs , 2010, Journal of Computer Science and Technology.

[11]  Robert Tappan Morris,et al.  Reinventing Scheduling for Multicore Systems , 2009, HotOS.

[12]  Zhao Zhang,et al.  Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[13]  Ram Huggahalli,et al.  Direct Cache Access for High Bandwidth Network I/O , 2005, ISCA 2005.

[14]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Mingyu Chen,et al.  DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.