High-Bandwidth Address Translation for Multiple-Issue Processors

In an effort to push the envelope of system performance, microprocessor designs are continually exploiting higher levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation mechanism. Most current microprocessor designs meet this demand with a multi-ported TLB. While this design provides an excellent hit rate at each port, its access latency and area grow very quickly as the number of ports is increased. As bandwidth demands continue to increase, multi-ported designs will soon impact memory access latency.We present four high-bandwidth address translation mechanisms with latency and area characteristics that scale better than a multiported TLB design. We extend traditional high-bandwidth memory design techniques to address translation, developing interleaved and multi-level TLB designs. In addition, we introduce two new designs crafted specifically for high-bandwidth address translation. Piggyback ports are introduced as a technique to exploit spatial locality in simultaneous translation requests, allowing accesses to the same virtual memory page to combine their requests at the TLB access port. Pretranslation is introduced as a technique for attaching translations to base register values, making it possible to reuse a single translation many times.We perform extensive simulation-based studies to evaluate our designs. We vary key system parameters, such as processor model, page size, and number of architected registers, to see what effects these changes have on the relative merits of each approach. A number of designs show particular promise. Multi-level TLBs with as few as eight entries in the upper-level TLB nearly achieve the performance of a TLB with unlimited bandwidth. Piggyback ports combined with a lesser-ported TLB structure, e.g., an interleaved or multi-ported TLB, also perform well. Pretranslation over a single-ported TLB performs almost as well as a same-sized multi-level TLB with the added benefit of decreased access latency for physically indexed caches.

[1]  Norman P. Jouppi,et al.  Tradeoffs in two-level on-chip caching , 1994, ISCA '94.

[2]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[3]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[4]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.

[5]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[6]  Jeffrey S. Chase,et al.  Architecture support for single address space operating systems , 1992, ASPLOS V.

[7]  Sheldon B. Levenstein,et al.  AS/400 64-bit powerPC-compatible processor implementation , 1994, Proceedings 1994 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[8]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[9]  Fred C. Chow,et al.  How many addressing modes are enough? , 1987, ASPLOS.

[10]  D HillMark,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994 .

[11]  Norman P. Jouppi,et al.  A simulation based study of TLB performance , 1992, ISCA '92.

[12]  Randy H. Katz,et al.  Eliminating the address translation bottleneck for physical address cache , 1992, ASPLOS V.

[13]  L. Liu,et al.  Early resolution of address translation in cache design , 1990, Proceedings., 1990 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[14]  Gurindar S. Sohi,et al.  Request Combining in Multiprocessors with Arbitrary Interconnection Networks , 1994, IEEE Trans. Parallel Distributed Syst..

[15]  Yale N. Patt,et al.  A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[16]  Richard E. Kessler,et al.  Inexpensive Implementations Of Set-Associativity , 1989, The 16th Annual International Symposium on Computer Architecture.

[17]  W. H. Wang,et al.  Organization and performance of a two-level virtual-real cache hierarchy , 1989, ISCA '89.

[18]  Michael J. Flynn,et al.  Translation hint buffers to reduce access time of physically-addressed instruction caches , 1992, MICRO.

[19]  Stamatis Vassiliadis,et al.  A load-instruction unit for pipelined processors , 1993, IBM J. Res. Dev..

[20]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS 1989.

[21]  Kamran Eshraghian,et al.  Principles of CMOS VLSI Design: A Systems Perspective , 1985 .

[22]  Burzin A. Patel,et al.  Optimization of instruction fetch mechanisms for high issue rates , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[23]  Henry M. Levy,et al.  Computer Programming and Architecture: The VAX-11 , 1980 .

[24]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .

[25]  Fred C. Chow,et al.  How many addressing modes are enough , 1987, ASPLOS 1987.

[26]  Michael J. Flynn,et al.  Translation hint buffers to reduce access time of physically-addressed instruction caches , 1992, MICRO 1992.

[27]  James R. Larus,et al.  Design Decisions in SPUR , 1986, Computer.

[28]  FranklinManoj,et al.  High-bandwidth data memory systems for superscalar processors , 1991 .

[29]  R. D. Jolly,et al.  A 9-ns, 1.4-gigabyte/s, 17-ported CMOS register file , 1991 .