IOMMU: strategies for mitigating the IOTLB bottleneck

The input/output memory management unit (IOMMU) was recently introduced into mainstream computer architecture when both Intel and AMD added IOMMUs to their chip-sets. An IOMMU provides memory protection from I/O devices by enabling system software to control which areas of physical memory an I/O device may access. However, this protection incurs additional direct memory access (DMA) overhead due to the required address resolution and validation. IOMMUs include an input/output translation lookaside buffer (IOTLB) to speed-up address resolution, but still every IOTLB cache-miss causes a substantial increase in DMA latency and performance degradation of DMA-intensive workloads. In this paper we first demonstrate the potential negative impact of IOTLB cache-misses on workload performance. We then propose both system software and hardware enhancements to reduce IOTLB miss rate and accelerate address resolution. These enhancements can lead to a reduction of over 60% in IOTLB miss-rate for common I/O intensive workloads.

[1]  Scott Devine,et al.  Disco: running commodity operating systems on scalable multiprocessors , 1997, TOCS.

[2]  Alan L. Cox,et al.  Practical, transparent operating system support for superpages , 2002, OPSR.

[3]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[4]  Jimi Xenidis,et al.  Utilizing IOMMUs for Virtualization in Linux and Xen Muli , 2006 .

[5]  Anand Sivasubramaniam,et al.  Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks , 2002, SIGMETRICS '02.

[6]  Muli Ben-Yehuda,et al.  On the DMA mapping problem in direct device assignment , 2010, SYSTOR '10.

[7]  Laurent Moll,et al.  Systems performance measurement on PCI Pamette , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[8]  Gil Neiger,et al.  Intel ® Virtualization Technology for Directed I/O , 2006 .

[9]  Peter Druschel,et al.  Transparent operating system support for superpages , 2004 .

[10]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[11]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[12]  Beng-Hong Lim,et al.  Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor , 2001, USENIX Annual Technical Conference, General Track.

[13]  Dhabaleswar K. Panda,et al.  Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[14]  Mark D. Hill,et al.  Tradeoffs in supporting two page sizes , 1992, ISCA '92.

[15]  Alan L. Cox,et al.  Protection Strategies for Direct Access to Virtualized I/O Devices , 2008, USENIX Annual Technical Conference.

[16]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[17]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .