Use of superpages and subblocking in the address translation hierarchy

Most computers that support virtual memory translate virtual addresses to physical addresses using a translation lookaside buffer (TLB) and a page table. Time spent in TLB miss handling--number of TLB misses times average TLB miss penalty--is increasing due to workload, architectural, and technological trends. This thesis studies TLB architectures that reduce the number of TLB misses by increasing TLB reach--the maximum address space mapped by a TLB--and page table designs that decrease TLB miss penalty or support new TLB architectures without increasing TLB miss penalty. First, this thesis evaluates two TLB architectures in commercial use--superpages and complete subblocking. This thesis studies the benefits of superpages and the issues involved in modifying operating systems and page tables to support superpages. Complete subblocking allows processor designers to use larger chip areas to build large TLBs within cycle time constraints. Simulation results show that for comparable chip area, complete-subblock TLBs have faster access times and incur fewer TLB misses than single-page-size TLBs without requiring operating system changes. Second, this thesis proposes a new TLB architecture, partial subblocking, that combines the best features of complete subblocking and superpages. Simulation results show that superpage and subblock TLBs, for comparable chip area, incur fewer TLB misses than single-page-size TLBs. Further, partial-subblock TLBs require simpler operating systems and incur fewer misses than superpage TLBs. Third, superpage and partial-subblock TLBs are ineffective without operating system support. This thesis identifies the policies and mechanisms required to support these TLBs. In particular, this thesis proposes a physical memory allocation algorithm, page reservation, that makes partial-subblock TLBs effective or eliminates page copying in superpage creation. Fourth, this thesis suggests modifications to conventional page tables to support superpage and subblock TLBs and proposes a new page table structure, clustered page table, that augments hashed page tables with subblocking. Simulation results show that clustered page tables are smaller and have a faster access time than conventional page tables when using single-page-size TLBs. A clustered page table improves on these advantages when storing superpage and subblock PTEs.

[1]  Kenneth C. Knowlton,et al.  A fast storage allocator , 1965, CACM.

[2]  Brian N. Bershad,et al.  The impact of operating system structure on memory system performance , 1994, SOSP '93.

[3]  Mahadev Satyanarayanan,et al.  Design Trade-Offs in VAX-11 Translation Buffer Organization , 1981, Computer.

[4]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[5]  Alan Jay Smith,et al.  Second bibliography on Cache memories , 1991, CARN.

[6]  Ronald E. Barkley,et al.  A lazy buddy system bounded by two coalescing delays , 1989, SOSP '89.

[7]  Ad J. van de Goor,et al.  Amore Address Mapping with Overlapped Rotating Entries , 1987, IEEE Micro.

[8]  Witold Litwin,et al.  LH* - Linear Hashing for Distributed Files , 1993, SIGMOD Conference.

[9]  Tom Rogers,et al.  UNIX Kernel Support for OLTP Performance , 1993, USENIX Winter.

[10]  Dionisios N. Pnevmatikatos,et al.  Streamlining data cache access with fast address calculation , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[11]  G. Blanck,et al.  The SuperSPARC microprocessor , 1992, Digest of Papers COMPCON Spring 1992.

[12]  Jeffrey S. Chase,et al.  Architecture support for single address space operating systems , 1992, ASPLOS V.

[13]  David Abramson,et al.  Addressing Mechanisms for Large Virtual Memories , 1992, Comput. J..

[14]  David R. Cheriton,et al.  Application-controlled physical memory using external page-cache management , 1992, ASPLOS V.

[15]  Peter A. Franaszek,et al.  Use Bit Scanning in Replacement Decisions , 1979, IEEE Transactions on Computers.

[16]  William J. Dally,et al.  Hardware support for fast capability-based addressing , 1994, ASPLOS VI.

[17]  Shreekant Thakkar,et al.  A high-performance memory management scheme , 1986 .

[18]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[19]  James R. Larus,et al.  Design Decisions in SPUR , 1986, Computer.

[20]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[21]  Anant Agarwal,et al.  FUGU: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor , 1994 .

[22]  Peter J. Denning,et al.  The working set model for program behavior , 1968, CACM.

[23]  Henry M. Levy,et al.  Virtual Memory Management in the VAX/VMS Operating System , 1982, Computer.

[24]  Doug Hunt,et al.  Advanced performance features of the 64-bit PA-8000 , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[25]  Divesh Srivastava,et al.  Implementation of the CORAL deductive database system , 1993, SIGMOD Conference.

[26]  Randy H. Katz,et al.  Eliminating the address translation bottleneck for physical address cache , 1992, ASPLOS V.

[27]  Alan Jay Smith,et al.  Bibliography on paging and related topics , 1978, OPSR.

[28]  Thomas Roberts Puzak,et al.  Analysis of cache replacement-algorithms , 1985 .

[29]  Brian N. Bershad,et al.  Reducing TLB and memory overhead using online superpage promotion , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[30]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[31]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[32]  Kotagiri Ramamohanarao,et al.  Hardware Address Translation for Machines with a Large Virtual Memory , 1981, Inf. Process. Lett..

[33]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS 1989.

[34]  Richard L. Sites,et al.  Alpha AXP architecture , 1993, CACM.

[35]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[36]  Charles R. Moore,et al.  The Power PC 601 microprocessor , 1993, IEEE Micro.

[37]  William J. Dally A Fast Translation Method for Paging on top of Segmentation , 1992, IEEE Trans. Computers.

[38]  Richard E. Kessler,et al.  An analysis of distributed shared memory algorithms , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[39]  Trevor N. Mudge,et al.  Design tradeoffs for software-managed TLBs , 1994, TOCS.

[40]  T. Wada,et al.  An analytical access time model for on-chip cache memories , 1992 .

[41]  James L. Peterson,et al.  Buddy systems , 1977, CACM.

[42]  David L. Black,et al.  Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures , 1987, ASPLOS 1987.

[43]  Ron Clark,et al.  Symmetric multiprocessing for the AIX operating system , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[44]  Helen Custer,et al.  Inside Windows NT , 1992 .

[45]  Michael N. Nelson,et al.  Virtual memory support for multiple page sizes , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[46]  Corporate Intel Corp. i860 microprocessor family programmer's reference manual , 1992 .

[47]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[48]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[49]  Richard L. Sites,et al.  Alpha Architecture Reference Manual , 1995 .

[50]  C. R. Moore The PowerPC 601 microprocessor , 1993, Digest of Papers. Compcon Spring.

[51]  Mark A. Franklin,et al.  Anomalies with variable partition paging algorithms , 1978, CACM.

[52]  Henry M. Levy,et al.  Segmented FIFO page replacement , 1981, SIGMETRICS '81.

[53]  Yousef A. Khalidi,et al.  Improving the Address Translation Performance of Widely Shared Pages , 1995 .

[54]  Paul Walton Purdom,et al.  Statistical Properties of the Buddy System , 1970, JACM.

[55]  P. Bannon,et al.  Internal architecture of Alpha 21164 microprocessor , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[56]  Ramesh Balan,et al.  A Scalable Implementation of Virtual Memory HAT Layer for Shared Memory Multiprocessor Machines , 1992, USENIX Summer.

[57]  Elliott I. Organick,et al.  The multics system: an examination of its structure , 1972 .

[58]  Mark Horowitz,et al.  Cache performance of operating system and multiprogramming workloads , 1988, TOCS.

[59]  Kai Li,et al.  Implementation and performance of application-controlled file caching , 1994, OSDI '94.

[60]  Thomas Thomas,et al.  The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[61]  Milan Milenkovic Microprocessor memory management units , 1990, IEEE Micro.

[62]  Jochen Liedtke Address Space Sparsity and Fine Granularity , 1995, ACM SIGOPS Oper. Syst. Rev..

[63]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[64]  John R. Mashey,et al.  Operating System Support on a RISC , 1986, COMPCON.

[65]  Norman P. Jouppi,et al.  A simulation based study of TLB performance , 1992, ISCA '92.

[66]  Andrew W. Appel,et al.  Virtual memory primitives for user programs , 1991, ASPLOS IV.

[67]  Robert S. Fabry,et al.  Capability-based addressing , 1974, CACM.

[68]  Neil Weste,et al.  Principles of CMOS VLSI Design , 1985 .

[69]  Freeman L. Rawson,et al.  The Design of Operating System/2 , 1988, IBM Syst. J..

[70]  David A. Wood,et al.  An in-cache address translation mechanism , 1986, ISCA '86.

[71]  Carla Schlatter Ellis,et al.  Concurrency in linear hashing , 1987, TODS.

[72]  David A. Wood,et al.  Active memory: a new abstraction for memory-system simulation , 1995, SIGMETRICS '95/PERFORMANCE '95.

[73]  Albert Chang,et al.  801 storage: architecture and programming , 1988, TOCS.

[74]  Yousef A. Khalidi,et al.  A Study of the Structure and Performance of MMU Handling Software , 1994 .

[75]  Cathy May,et al.  The PowerPC Architecture: A Specification for a New Family of RISC Processors , 1994 .

[76]  Benjamin G. Zorn,et al.  Using lifetime predictors to improve memory allocation performance , 1993, PLDI '93.

[77]  Jeff Yetter,et al.  Performance features of the PA7100 microprocessor , 1993, IEEE Micro.

[78]  Richard Eugene Kessler Analysis of multi-megabyte secondary CPU cache memories , 1992 .

[79]  L. R. Johnson,et al.  An indirect chaining method for addressing on secondary keys , 1961, CACM.

[80]  Toyohiko Kagimasa,et al.  Adaptive storage management for very large virtual/real storage systems , 1991, ISCA '91.

[81]  Michael G. Gallup,et al.  The 68040 processor. 2. Memory design and chip , 1990, IEEE Micro.

[82]  Alan Jay Smith,et al.  A Comparative Study of Set Associative Memory Mapping Algorithms and Their Use for Cache and Main Memory , 1978, IEEE Transactions on Software Engineering.

[83]  Albert Chang,et al.  Evolution of Storage Facilities in AIX Version 3 for RISC System/6000 Processors , 1990, IBM J. Res. Dev..

[84]  Ruby B. Lee Precision architecture , 1989, Computer.

[85]  Cedell Alexander,et al.  Cache memory performance in a unix enviroment , 1986, CARN.

[86]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[87]  Samuel J. Leffler,et al.  The design and implementation of the 4.3 BSD Unix operating system , 1991, Addison-Wesley series in computer science.

[88]  R. S. Fabry,et al.  MIN—an optimal variable-space page replacement algorithm , 1976, CACM.

[89]  Jerry Huck,et al.  Architectural support for translation table management in large address space machines , 1993, ISCA '93.

[90]  Yannick Deville,et al.  A class of replacement policies for medium and high-associativity structures , 1992, CARN.

[91]  Andrew W. Appel,et al.  Standard ML of New Jersey , 1991, PLILP.

[92]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[93]  Brian N. Bershad,et al.  Consistency management for virtually indexed caches , 1992, ASPLOS V.

[94]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.

[95]  Faye Briggs,et al.  Translation buffer performance in a UNIX enviroment , 1985, CARN.

[96]  John Rosenberg,et al.  MONADS-PC - a capability-based workstation to support software engineering , 1985 .

[97]  Ronald Fagin,et al.  Extendible hashing—a fast access method for dynamic files , 1979, ACM Trans. Database Syst..

[98]  Trevor N. Mudge,et al.  Optimal allocation of on-chip memory for multiple-API operating systems , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[99]  Michael J. Flynn,et al.  An area model for on-chip memories and its application , 1991 .

[100]  Francis F. Lee,et al.  Study of "Look-Aside" Memory , 1969, IEEE Transactions on Computers.

[101]  Vijay Kumar,et al.  Concurrent operations on extendible hashing and its performance , 1990, CACM.

[102]  Wei-Pang Yang,et al.  Concurrent Operations in Extendible Hashing , 1986, VLDB.

[103]  John Slice,et al.  The Parallelization of UNIX System V Release 4.0 , 1991, USENIX Winter.

[104]  Richard E. Kessler,et al.  Page placement algorithms for large real-indexed caches , 1992, TOCS.

[105]  P.P. Gelsinger,et al.  Microprocessors circa 2000 , 1989, IEEE Spectrum.

[106]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[107]  Daniel S. Hirschberg,et al.  A class of dynamic memory allocation algorithms , 1973, CACM.

[108]  M. Frans Kaashoek,et al.  Software prefetching and caching for translation lookaside buffers , 1994, OSDI '94.

[109]  E. L. Glaser,et al.  System design of a computer for time sharing applications , 1965, AFIPS '65 (Fall, part I).

[110]  James E. Smith,et al.  The ZS-1 central processor , 1987, ASPLOS 1987.

[111]  Alan Jay Smith,et al.  Experimental evaluation of on-chip microprocessor cache memories , 1984, ISCA 1984.

[112]  George Radin,et al.  The 801 minicomputer , 1982, ASPLOS I.

[113]  David A. Wood,et al.  Implementing stack simulation for highly-associative memories , 1991, SIGMETRICS '91.

[114]  Trevor Mudge,et al.  Monster : a tool for analyzing the interaction between operating systems and computer architectures , 1992 .

[115]  Thomas J. LeBlanc,et al.  Adjustable block size coherent caches , 1992, ISCA '92.

[116]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[117]  Mark D. Hill,et al.  A new page table for 64-bit address spaces , 1995, SOSP.

[118]  Norman P. Jouppi,et al.  Tradeoffs in two-level on-chip caching , 1994, ISCA '94.

[119]  Peter J. Denning,et al.  A study of program locality and lifetime functions , 1975, SOSP.

[120]  John Wilkes,et al.  A comparison of Protection Lookaside Buffers and the PA-RISC protection architecture , 1992 .

[121]  Jeffrey C. Mogul,et al.  Performance Implications of Multiple Pointer Sizes , 1995, USENIX.

[122]  Robert S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[123]  Randy H. Katz,et al.  A VLSI chip set for a multiprocessor workstation. I. An RISC microprocessor with coprocessor interface and support for symbolic processing , 1989 .

[124]  David Chih-Wei Chang,et al.  Microarchitecture of HaL's memory management unit , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[125]  J. Yetter,et al.  A high speed superscalar PA-RISC processor , 1992, Digest of Papers COMPCON Spring 1992.

[126]  Robert H. Morris,et al.  Scatter storage techniques , 1983, CACM.

[127]  Douglas W. Clark,et al.  Performance of the VAX-11/780 translation buffer: simulation and measurement , 1985, TOCS.

[128]  William J. Bolosky,et al.  Mach: A New Kernel Foundation for UNIX Development , 1986, USENIX Summer.

[129]  Ken Thompson,et al.  The UNIX time-sharing system , 1974, CACM.

[130]  Chia-Jiu Wang,et al.  Implementing precise interruptions in pipelined RISC processors , 1993, IEEE Micro.

[131]  Patricia J. Teller Translation-lookaside buffer consistency , 1990, Computer.

[132]  Peter Davies,et al.  The TLB slice—a low-cost high-speed address translation mechanism , 1990, ISCA '90.

[133]  Gordon Bell,et al.  An Investigation of Alternative Cache Organizations , 1974, IEEE Transactions on Computers.

[134]  Mark D. Hill,et al.  A case for direct-mapped caches , 1988, Computer.

[135]  Mark D. Hill,et al.  Tradeoffs in supporting two page sizes , 1992, ISCA '92.

[136]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[137]  D. B. Davis,et al.  Sun Microsystems Inc. , 1993 .

[138]  Maurice J. Bach The Design of the UNIX Operating System , 1986 .

[139]  Steve R. Kleiman,et al.  Vnodes: An Architecture for Multiple File System Types in Sun UNIX , 1986, USENIX Summer.

[140]  J. Bradley Chen,et al.  Software methods for system address tracing , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[141]  Jeffrey C. Mogul Big memories on the desktop , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[142]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[143]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[144]  Michael Wayne Young Exporting a user interface to memory management from a communication-oriented operating system , 1989 .

[145]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[146]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[147]  Peter J. Denning Virtual Memory , 1996, ACM Comput. Surv..

[148]  Soummya Mallick,et al.  A new PowerPC microprocessor for low power computing systems , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[149]  Mark Smith,et al.  Beyond Multiprocessing: Multithreading the SunOS Kernel , 1992, USENIX Summer.

[150]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[151]  Kimming So,et al.  Cache Operations by MRU Change , 1988, IEEE Trans. Computers.

[152]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .