A survey of techniques for architecting TLBs

Translation lookaside buffer (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high‐end servers. Because TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects, and system engineers.

[1]  Osman S. Unsal,et al.  Energy-efficient address translation , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  Gabriel H. Loh,et al.  Entropy-based low power data TLB design , 2006, CASES '06.

[3]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[4]  Mithuna Thottethodi,et al.  PreTrans: Reducing TLB CAM-search via page number prediction and speculative pre-translation , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[5]  Ján Veselý,et al.  Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Rami G. Melhem,et al.  PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future CMPs , 2013, TACO.

[7]  Trevor N. Mudge,et al.  Uniprocessor Virtual Memory without TLBs , 2001, IEEE Trans. Computers.

[8]  James R. Goodman Coherency for multiprocessor virtual address caches , 1987, ASPLOS 1987.

[9]  Anand Sivasubramaniam,et al.  Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks , 2002, SIGMETRICS '02.

[10]  M. Frans Kaashoek,et al.  Software prefetching and caching for translation lookaside buffers , 1994, OSDI '94.

[11]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[12]  Gabriel H. Loh,et al.  Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[13]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Jeffrey S. Vetter,et al.  Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing , 2015, Computing in Science & Engineering.

[15]  Yen-Jen Chang An Ultra Low-Power TLB Design , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[16]  Jaehyuk Huh,et al.  Efficient synonym filtering and scalable delayed translation for hybrid virtual caching , 2016, International Symposium on Computer Architecture.

[17]  Trevor N. Mudge,et al.  Design Tradeoffs For Software-managed Tlbs , 1994, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[18]  Jeffrey S. Vetter,et al.  A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[19]  Sparsh Mittal,et al.  A Survey of Recent Prefetching Techniques for Processor Caches , 2016, ACM Comput. Surv..

[20]  Renato J. O. Figueiredo,et al.  On the Performance of Tagged Translation Lookaside Buffers: A Simulation-Driven Analysis , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[21]  Yanan Wang,et al.  Scattered superpage: A case for bridging the gap between superpage and page coloring , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[22]  Seh-Woong Jeong,et al.  A Low Power TLB Structure for Embedded Systems , 2002, IEEE Computer Architecture Letters.

[23]  Norman P. Jouppi,et al.  A simulation based study of TLB performance , 1992, ISCA '92.

[24]  G. Kandiraju,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[25]  David A. Wood,et al.  An in-cache address translation mechanism , 1986, ISCA '86.

[26]  Ján Veselý,et al.  Large pages and lightweight memory management in virtualized environments: Can you have it both ways? , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Gurindar S. Sohi,et al.  Revisiting virtual L1 caches: A practical design using dynamic synonym remapping , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[28]  Todd M. Austin,et al.  High-Bandwidth Address Translation for Multiple-Issue Processors , 1996, ISCA.

[29]  Jang-Suk Park,et al.  A software-controlled prefetching mechanism for software-managed TLBs , 1995, Microprocess. Microprogramming.

[30]  W. H. Wang,et al.  Organization and performance of a two-level virtual-real cache hierarchy , 1989, ISCA '89.

[31]  Zhen Fang,et al.  Reducing cache and TLB power by exploiting memory region and privilege level semantics , 2013, J. Syst. Archit..

[32]  Xin Tong,et al.  Prediction-based superpage-friendly TLB designs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[33]  Anand Sivasubramaniam,et al.  Generating physical addresses directly for saving instruction TLB energy , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[34]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[35]  Randy H. Katz,et al.  Eliminating the address translation bottleneck for physical address cache , 1992, ASPLOS V.

[36]  Mahmut T. Kandemir,et al.  Compiler-directed code restructuring for reducing data TLB energy , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..

[37]  Aviral Shrivastava,et al.  B2P2: bounds based procedure placement for instruction TLB power reduction in embedded systems , 2010, SCOPES.

[38]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[39]  Michael M. Swift,et al.  Reducing memory reference energy with opportunistic virtual caching , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[40]  Hsien-Hsin S. Lee,et al.  Synonymous address compaction for energy reduction in data TLB , 2005, ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005..

[41]  Xin Tong,et al.  BarTLB: Barren page resistant TLB for managed runtime languages , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[42]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches , 2013, MICRO.

[43]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[44]  Mahmut T. Kandemir,et al.  Generating physical addresses directly for saving instruction TLB energy , 2002, MICRO.

[45]  Albert Y. Zomaya,et al.  A Survey of Mobile Device Virtualization , 2016, ACM Comput. Surv..

[46]  Yiran Chen,et al.  STD-TLB: A STT-RAM-based dynamically-configurable translation lookaside buffer for GPU architectures , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[47]  William D. Strecker 17 – VAX-11/780: A Virtual Address Extension to the DEC PDP-11 Family , 1978 .

[48]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[49]  L.T. Clark,et al.  A low-power 2.5-GHz 90-nm level 1 cache and memory management unit , 2005, IEEE Journal of Solid-State Circuits.

[50]  William D. Strecker,et al.  VAX-11/780 - A virtual address extension to the DEC PDP-11 family , 1899, AFIPS National Computer Conference.

[51]  Michel Dubois,et al.  The Synonym Lookaside Buffer: A Solution to the Synonym Problem in Virtual Caches , 2008, IEEE Transactions on Computers.

[52]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[53]  Juan E. Navarro,et al.  Practical, transparent operating system support for superpages , 2002, OSDI '02.

[54]  Brian N. Bershad,et al.  Reducing TLB and memory overhead using online superpage promotion , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[55]  Aviral Shrivastava,et al.  Code Transformations for TLB Power Reduction , 2009, VLSI Design.

[56]  David B. Whalley,et al.  Designing a practical data filter cache to improve both energy efficiency and performance , 2013, ACM Trans. Archit. Code Optim..

[57]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[58]  Mahmut T. Kandemir,et al.  Reducing Data TLB Power via Compiler-Directed Address Generation , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[59]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[60]  Hsien-Hsin S. Lee,et al.  Improving TLB energy for java applications on JVM , 2008, 2008 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation.

[61]  Hsien-Hsin S. Lee,et al.  Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning , 2003, ISLPED '03.

[62]  Jang-Soo Lee,et al.  A banked-promotion TLB for high performance and low power , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[63]  Tomás Lang,et al.  Reducing TLB power requirements , 1997, Proceedings of 1997 International Symposium on Low Power Electronics and Design.

[64]  Mahmut T. Kandemir,et al.  Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[65]  Collin McCurdy,et al.  Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors , 2008, ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software.

[66]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[67]  Margaret Martonosi,et al.  TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[68]  Peter Petrov,et al.  Context-aware TLB preloading for interference reduction in embedded multi-tasked systems , 2010, GLSVLSI '10.

[69]  Ching-Wen Chen,et al.  Energy-efficient synonym data detection and consistency for virtual cache , 2016, Microprocess. Microsystems.

[70]  Mark D. Hill,et al.  Tradeoffs in supporting two page sizes , 1992, ISCA '92.

[71]  Margaret Martonosi,et al.  Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS XV.

[72]  Michel Cekleov,et al.  Virtual-address caches. Part 1: problems and solutions in uniprocessors , 1997, IEEE Micro.

[73]  Mahmut T. Kandemir,et al.  Reducing dTLB energy through dynamic resizing , 2003, Proceedings 21st International Conference on Computer Design.

[74]  Sparsh Mittal,et al.  Exploring Design Space of 3D NVM and eDRAM Caches Using DESTINY Tool (open-source code) , 2015 .

[75]  Mahmut T. Kandemir,et al.  Optimizing instruction TLB energy using software and hardware techniques , 2005, TODE.

[76]  Daeyeon Park,et al.  Boosting superpage utilization with the shadow memory and the partial-subblock TLB , 2000, ICS '00.

[77]  Stefanos Kaxiras,et al.  A new perspective for efficient virtual-cache coherence , 2013, ISCA.

[78]  Abhishek Bhattacharjee,et al.  Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[79]  Antonio Robles,et al.  Efficient TLB-Based Detection of Private Pages in Chip Multiprocessors , 2016, IEEE Transactions on Parallel and Distributed Systems.

[80]  Renato J. O. Figueiredo,et al.  TMT - A TLB Tag Management Framework for Virtualized Platforms , 2009, SBAC-PAD.

[81]  Ryan N. Rakvic,et al.  A comprehensive study of hardware/software approaches to improve TLB performance for java applications on embedded systems , 2006, MSPC '06.

[82]  Alan L. Cox,et al.  SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[83]  Peter Davies,et al.  The TLB slice—a low-cost high-speed address translation mechanism , 1990, ISCA '90.

[84]  Rajeev Balasubramonian,et al.  A Dynamically Tunable Memory Hierarchy , 2003, IEEE Trans. Computers.

[85]  Osman S. Unsal,et al.  Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[86]  Avi Mendelson,et al.  DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[87]  Peter Davies,et al.  The TLB slice-a low-cost high-speed address translation mechanism , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[88]  Daniel J. Sorin,et al.  UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.