Cache restoration for highly partitioned virtualized systems

The economics of server consolidation have led to the support of virtualization features in almost all server-class systems, with the related feature set being a subject of significant competition. While most systems allow for partitioning at the relatively coarse grain of a single core, some systems also support multiprogrammed virtualization, whereby a system can be more finely partitioned through time-sharing, down to a percentage of a core being allotted to a virtual machine. When multiple virtual machines share a single core however, performance can suffer due to the displacement of microarchitectural state. We introduce cache restoration, a hardware-based prefetching mechanism initiated by the underlying virtualization software when a virtual machine is being scheduled on a core, prefetching its working set and warming its initial environment. Through cycle-accurate simulation of a POWER7 system, we show that when applied to its private per-core L3 last-level cache, the warm cache translates into 20% on average performance improvement for a mixture of workloads on a highly partitioned core, compared to a virtualized server without cache restoration.

[1]  Balaram Sinharoy,et al.  IBM POWER7 multicore server processor , 2011 .

[2]  Dean M. Tullsen,et al.  Fast thread migration via cache working set prediction , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[3]  Fang Liu,et al.  Characterizing and modeling the behavior of context switch misses! , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  工藤 真臣,et al.  VMware vSphere 4 , 2009 .

[5]  Guillaume Urvoy-Keller,et al.  Networking in a virtualized environment: The TCP case , 2013, 2013 IEEE 2nd International Conference on Cloud Networking (CloudNet).

[6]  Mark Horowitz,et al.  Cache performance of operating system and multiprogramming workloads , 1988, TOCS.

[7]  Jeffrey C. Mogul,et al.  The effect of context switches on cache performance , 1991, ASPLOS IV.

[8]  Suleyman Sair,et al.  Extending data prefetching to cope with context switch misses , 2009, 2009 IEEE International Conference on Computer Design.

[9]  Babak Falsafi,et al.  Predictor virtualization , 2008, ASPLOS.

[10]  Lixin Zhang,et al.  Mambo: a full system simulator for the PowerPC architecture , 2004, PERV.

[11]  M. Lipasti,et al.  Opportunities for Cache Friendly Process Scheduling , 2005 .

[12]  Harold S. Stone,et al.  Footprints in the cache , 1986, SIGMETRICS '86/PERFORMANCE '86.

[13]  Thomas F. Wenisch,et al.  Practical off-chip meta-data for temporal memory streaming , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[14]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[15]  Onur Mutlu,et al.  Coordinated control of multiple prefetchers in multi-core systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).