Re-architecting VMMs for Multicore Systems : The Sidecore Approach

Future many-core platforms present scalability challenges to VMMs, including the need to efficiently utilize their processor and cache resources. Focusing on platform virtualization, we address these challenges by devising new virtualization methods that not only work with, but actually exploit the many-core nature of future processors. Specifically, we utilize the fact that cores will differ with respect to their current internal processor and memory states. The hypervisor, or VMM, then leverages these differences to substantially improve VMM performance and better utilize these cores. The key idea underlying this work is simple: to carry out some privileged VMM operation, rather than forcing a core to undergo an expensive internal state change via traps, such as VMexit in Intel’s VT architecture, why not have the operation carried out by a remote core that is already in the appropriate state? Termed the sidecoreapproach to running VMM-level functionality, it can be used to run VMM services more efficiently on remote cores that are already in VMM state. This paper demonstrates the viability and utility of the sidecore approach for two VMM-level classes of functionality: (1) efficient VM-VMM communication in VT-enabled processors and (2) interrupt virtualization for selfvirtualized devices. I. SIDECORES: STRUCTURING HYPERVISORS FOR MANY-CORE PLATFORMS Virtualization technologies are becoming increasingly important for fully utilizing future many-core systems. Evidence of this fact are Virtual Machine Monitors (VMMs) like Xen [1] and VMWare [2], which support the creation and execution of multiple virtual machines (VMs) on a single platform in secure and isolated environments and manage physical resources of the host machine [3]. Further evidence are recent architecture advances, such as hardware support for virtualization ( e.g. Intel’s VT [4] and AMD’s Pacifica [5] technologies) and I/O virtualization support from upcoming PCI devices [6]. Unfortunately, current VMM designs are monolithic, that is, all cores on a virtualized multi-core platform execute the same set of VMM functionality. This paper advocates an alternative design choice, which is to structure a VMM as multiple components, with each component responsible for certain VMM functionality and internally structured to best meet its obligations [7]. As a result, in multiand many-core systems, these components can even execute on cores other than those on which their functions are called. Furthermore, it becomes possible to ‘specialize’ cores, permitting them to efficiently execute certain subsets of rather than complete sets of VMM functionality. There are multiple reasons why functionally specialized, componentized VMMs are superior to the current monolithic implementations of VMMs, particularly for future many-core platforms: 1) Since only specific VMM code pieces run on particular cores, performance for these code pieces may improve from reductions in cache misses, including the tracecache, D-cache, and TLB due to reduced sharing of these resources with other VMM code. Further, assuming VMM and guest VMs do not share a lot of data, VMM code and data are less likely to pollute a guest VM’s cache state, thereby improving overall guest performance. 2) By using a single core or a small set of cores for certain VMM functionality (e.g., page table management), locking requirements may be reduced for shared data structures, such as guest VM page tables. This can positively impact the scalability of SMP guest VMs. 3) When a core executes a VMM function, it is already in the appropriate processor state for running another such function, thus reducing or removing the need for expensive processor state changes (e.g., the VMexit trap in Intel’s VT architecture). Some of the performance measurements presented in this paper leverage this fact (see Section II). 4) In heterogeneous multicore systems, some of these cores may be specialized and hence, can offer improved performance for doing certain tasks compared to other nonspecialized cores [8]. 5) Dedicating a core can provide better performance and scalability for the I/O virtualization path, as demonstrated in Section III. 6) To take full advantage of many computational cores, future architectures will likely offer fast core-to-core communication infrastructures [9], rather than relying on relatively slow memory-based communications. The sidecore approach can leverage those technology developments. Initial evidence are high performance intercore interconnects, such as AMD’s HyperTransport [10] and Intel’s planned CSI. In this paper, we propose sidecores as a means for structuring future VMMs in many-core systems. The current implementation dedicates a single core, termed sidecore, to perform specific VMM functions. This sidecore differs fromnormal cores in that it only executes one or a small set of VMM functionality, whereas normal cores execute generic guest VM and VMM code. A service request to any such sidecore is termed asidecall , and such calls can be made from a guest VM or from a platform component, such as an I/O device. The result is a VMM that attains improved performance by internally using the client-server paradigm, where the VMM (server) executing on a different core performs a service requested by VMs or peripherals (clients). We demonstrate the viability and advantages of the sidecore approach in two ways. First, a sidecore is used to perform efficient routing of service requests from the guest VM to a VMM, to avoid costlyVMexitsin VT-enabled processors. Second, the sidecore approach is used to enhance the I/O virtualization capabilities of self-virtualized devices via efficient interrupt virtualization. We conclude the paper with related work and future directions. II. EFFICIENT GUEST VM-VMM C OMMUNICATION IN VT-ENABLED PROCESSORS Earlier implementations of the x86 architecture were not conducive toclassicaltrap-and-emulate virtualization [11] due to the behavior of certain instructions. System virtualization techniques for x86 architecture included either non-intrusive but costly binary rewriting [2] or efficient but highly intrusive paravirtualization [1]. These issues are addressed by architecture enhancements added by Intel [4] and AMD [5]. In Intel’s case, the basic mechanisms in VT-enabled processors for virtualization areVMentry and VMexit. When the guest VM performs a privileged operation it is not permitted to perform, or when guest VM explicitly requests service from the VMM, it generates a VMexit and the control is transferred to the VMM. The VMM performs the requested operation on guest’s behalf and returns to guest VM using VMentry. Hence, the cost of VMentry and VMexit is an important factor in the performance of implementation methods for system virtualization. Microbenchmark results presented in Figure 1 compare the cost of VMentry and VMexit with the intercore communication latency experienced by the sidecore approach. These results are gathered on a 3.0 GHz dual-core X86-64 bit, VTenabled system, running a uni-processor VT-enabled guest VM (hereafter referred to as hvm domain). The hvm domain runs an unmodified Linux 2.6.16.13 kernel and is allocated 256MB RAM. The latest unstable version of Xen 3.0 is used as the VMM. The figure shows the VMexit latency for three cases when the hvm domain needs to communicate with the VMM: (1) for making a ‘Null’ call where VMCALL instruction is used to cause VMexit but VMM immediately returns; (2) for obtaining the result of CPUID instruction which causes VMexit and VMM executes the real CPUID instruction on hvm domain’s behalf and returns the result; and (3) for performing page table updates, which may result in a VMexit and corresponding shadow page table management by the VMM. The figure also presents comparative results when VMVMM communication is implemented as a sidecall using shared memory (shm), as depicted in Figure 2. In particular, one core is assigned as the sidecore, and the other core runs the hvm domain, with a slightly modified Linux kernel. When Null Call CPUID PTE Update

[1]  Gurindar S. Sohi,et al.  The use of multithreading for exception handling , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[2]  Direct addressed caches for reduced power consumption , 2001, MICRO.

[3]  Liviu Iftode,et al.  TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance , 2002 .

[4]  Frank T. Hady,et al.  Platform level support for high throughput edge applications: the Twin Cities prototype , 2003 .

[5]  HarrisTim,et al.  Xen and the art of virtualization , 2003 .

[6]  Vikram A. Saletore,et al.  ETA: experience with an Intel Xeon processor as a packet processing engine , 2004, IEEE Micro.

[7]  Ada Gavrilovska,et al.  C-CORE: Using Communication Cores for High Performance Network Services , 2005, Fourth IEEE International Symposium on Network Computing and Applications.

[8]  Amin Vahdat,et al.  Enforcing Performance Isolation Across Virtual Machines in Xen , 2006, Middleware.

[9]  Koushik Chakraborty,et al.  Computation spreading: employing hardware migration to specialize CMP cores on-the-fly , 2006, ASPLOS XII.

[10]  Dhabaleswar K. Panda,et al.  High Performance VMM-Bypass I/O in Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[11]  Ole Agesen,et al.  A comparison of software and hardware techniques for x86 virtualization , 2006, ASPLOS XII.

[12]  Vikram A. Saletore,et al.  Evaluating network processing efficiency with processor partitioning and asynchronous I/O , 2006, EuroSys.

[13]  Dilma Da Silva,et al.  K42: building a complete operating system , 2006, EuroSys.

[14]  Bratin Saha,et al.  Enabling scalability and performance in a large scale CMP environment , 2007, EuroSys '07.

[15]  Karsten Schwan,et al.  High performance and scalable I/O virtualization via self-virtualized devices , 2007, HPDC '07.