论文信息 - Efficient synonym filtering and scalable delayed translation for hybrid virtual caching

Efficient synonym filtering and scalable delayed translation for hybrid virtual caching

Conventional translation look-aside buffers(TLBs) are required to complete address translation withshort latencies, as the address translation is on the criticalpath of all memory accesses even for L1 cache hits. Such strictTLB latency restrictions limit the TLB capacity, as the latencyincrease with large TLBs may lower the overall performanceeven with potential TLB miss reductions. Furthermore, TLBsconsume a significant amount of energy as they are accessedfor every instruction fetch and data access. To avoid thelatency restriction and reduce the energy consumption, virtualcaching techniques have been proposed to defer translation toafter L1 cache misses. However, an efficient solution for thesynonym problem has been a critical issue hindering the wideadoption of virtual caching.Based on the virtual caching concept, this study proposes ahybrid virtual memory architecture extending virtual cachingto the entire cache hierarchy, aiming to improve both performanceand energy consumption. The hybrid virtual cachinguses virtual addresses augmented with address space identifiers(ASID) in the cache hierarchy for common non-synonymaddresses. For such non-synonyms, the address translationoccurs only after last-level cache (LLC) misses. For uncommonsynonym addresses, the addresses are translated to physicaladdresses with conventional TLBs before L1 cache accesses. Tosupport such hybrid translation, we propose an efficient synonymdetection mechanism based on Bloom filters which canidentify synonym candidates with few false positives. For largememory applications, delayed translation alone cannot solvethe address translation problem, as fixed-granularity delayedTLBs may not scale with the increasing memory requirements.To mitigate the translation scalability problem, this studyproposes a delayed many segment translation designed for thehybrid virtual caching. The experimental results show that ourapproach effectively lowers accesses to the TLBs, leading tosignificant power savings. In addition, the approach providesperformance improvement with scalable delayed translationwith variable length segments.

Jaehyuk Huh | Chang Hyun Park | Taekyung Heo | Jaehyuk Huh | Taekyung Heo

[1] Gurindar S. Sohi,et al. Revisiting virtual L1 caches: A practical design using dynamic synonym remapping , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2] Osman S. Unsal,et al. Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[3] Stefanos Kaxiras,et al. A new perspective for efficient virtual-cache coherence , 2013, ISCA.

[4] Michael M. Swift,et al. Efficient virtual memory for big memory servers , 2013, ISCA.

[5] Aamer Jaleel,et al. CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[6] Michael M. Swift,et al. Reducing memory reference energy with opportunistic virtual caching , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[7] Shunfei Chen,et al. MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[8] Lixin Zhang,et al. Enigma: architectural and operating system support for reducing the impact of address translation , 2010, ICS '10.

[9] Srilatha Manne,et al. Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[10] S. Ng,et al. Memory Systems: Cache, DRAM, Disk , 2007 .

[11] Hsien-Hsin S. Lee,et al. Reducing energy of virtual cache synonym lookup using bloom filters , 2006, CASES '06.

[12] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[13] Donald Yeung,et al. BioBench: A Benchmark Suite of Bioinformatics Applications , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[14] Trevor N. Mudge,et al. Uniprocessor Virtual Memory without TLBs , 2001, IEEE Trans. Computers.

[15] Trevor N. Mudge,et al. Software-managed address translation , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[16] Mark D. Hill,et al. Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[17] Jeffrey S. Chase,et al. Architecture support for single address space operating systems , 1992, ASPLOS V.

[18] Wen-Hann Wang,et al. Organization And Performance Of A Two-level Virtual-real Cache Hierarchy , 1989, The 16th Annual International Symposium on Computer Architecture.

[19] James R. Goodman,et al. Coherency for multiprocessor virtual address caches , 1987, ASPLOS.

[20] Randy H. Katz,et al. An in-cache address translation mechanism , 1986, ISCA '86.

[21] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[22] Bruce Jacob,et al. DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[23] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .