论文信息 - Translation-Triggered Prefetching

Translation-Triggered Prefetching

We propose translation-enabled memory prefetching optimizations or TEMPO, a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem. We are the first to make the following observations: (1) a substantial fraction (20-40%) of DRAM references in modern big- data workloads are devoted to accessing page tables; and (2) when memory references require page table lookups in DRAM, the vast majority of them (98%+) also look up DRAM for the subsequent data access. TEMPO exploits these observations to enable DRAM row-buffer and on-chip cache prefetching of the data that page tables point to. TEMPO requires trivial changes to the memory controller (under 3% additional area), no OS or application changes, and improves performance by 10-30% and energy by 1-14%.

Abhishek Bhattacharjee | A. Bhattacharjee

[1] M. Frans Kaashoek,et al. RadixVM: scalable address spaces for multithreaded applications , 2013, EuroSys '13.

[2] Norman P. Jouppi,et al. Rethinking DRAM design and organization for energy-constrained multi-cores , 2010, ISCA.

[3] Osman S. Unsal,et al. Energy-efficient address translation , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4] Onur Mutlu,et al. The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[5] Onur Mutlu,et al. Research Problems and Opportunities in Memory Systems , 2014, Supercomput. Front. Innov..

[6] Jayneel Gandhi,et al. Efficient Memory Virtualization , 2016 .

[7] O Seongil,et al. Reducing memory access latency with asymmetric DRAM bank organizations , 2013, ISCA.

[8] Onur Mutlu,et al. Gather-Scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10] Ján Veselý,et al. Large pages and lightweight memory management in virtualized environments: Can you have it both ways? , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11] Srinivas Devadas,et al. IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12] Kang G. Shin,et al. Design and Implementation of Power-Aware Virtual Memory , 2003, USENIX ATC, General Track.

[13] Rachata Ausavarungnirun,et al. RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14] James E. Smith,et al. Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[15] Abhishek Bhattacharjee,et al. Large-reach memory management unit caches , 2013, MICRO.

[16] Bruce Jacob,et al. The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It , 2009, The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It.

[17] Seth H. Pugsley,et al. Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18] Dirk Grunwald,et al. A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[19] Mor Harchol-Balter,et al. ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[20] Onur Mutlu,et al. Tiered-latency DRAM: A low latency and low cost DRAM architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[21] David W. Nellans,et al. Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[22] Margaret Martonosi,et al. Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[23] R. Govindarajan,et al. Multiple sub-row buffers in DRAM: unlocking performance and energy improvement opportunities , 2012, ICS '12.

[24] Margaret Martonosi,et al. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[25] Onur Mutlu,et al. BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling , 2016, IEEE Transactions on Parallel and Distributed Systems.

[26] Michael M. Swift,et al. Reducing memory reference energy with opportunistic virtual caching , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[27] Onur Mutlu,et al. Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[28] Rachata Ausavarungnirun,et al. Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[29] M. Frans Kaashoek,et al. Scalable address spaces using RCU balanced trees , 2012, ASPLOS XVII.

[30] Abhishek Bhattacharjee,et al. Efficient Address Translation for Architectures with Multiple Page Sizes , 2017, ASPLOS.

[31] Onur Mutlu,et al. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[32] Margaret Martonosi,et al. COATCheck: Verifying Memory Ordering at the Hardware-OS Interface , 2016, ASPLOS.

[33] Uri C. Weiser,et al. Loop-Aware Memory Prefetching Using Code Block Working Sets , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[34] Gregory F. Snyder,et al. The illustris simulation: Public data release , 2015, Astron. Comput..

[35] Alan L. Cox,et al. Practical, transparent operating system support for superpages , 2002, OPSR.

[36] Osman S. Unsal,et al. Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[37] Chia-Lin Yang,et al. Improving DRAM latency with dynamic asymmetric subarray , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38] Natalie D. Enright Jerger,et al. Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[39] Onur Mutlu,et al. ChargeCache: Reducing DRAM latency by exploiting row access locality , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[40] Onur Mutlu,et al. Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[41] Uri C. Weiser,et al. Semantic locality and context-based prefetching using reinforcement learning , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[42] Alan L. Cox,et al. Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[43] Onur Mutlu,et al. A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[44] Thomas F. Wenisch,et al. CoScale: Coordinating CPU and Memory System DVFS in Server Systems , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[45] David W. Nellans,et al. Prediction Based DRAM Row-Buffer Management in the Many-Core Era , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[46] Alan L. Cox,et al. SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[47] Ján Veselý,et al. Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[48] Gabriel H. Loh,et al. Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[49] Srilatha Manne,et al. Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[50] Onur Mutlu,et al. Improving DRAM performance by parallelizing refreshes with accesses , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[51] Margaret Martonosi,et al. Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors , 2010, ASPLOS 2010.

[52] Aamer Jaleel,et al. CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[53] Xin Tong,et al. Prediction-based superpage-friendly TLB designs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[54] Michael M. Swift,et al. Efficient virtual memory for big memory servers , 2013, ISCA.

[55] Zhimin Zhang,et al. RBPP: A row based DRAM page policy for the many-core era , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[56] Qingyuan Deng,et al. MemScale: active low-power modes for main memory , 2011, ASPLOS XVI.