TLC: Transmission Line Caches

It is widely accepted that the disproportionate scalingof transistor and conventional on-chip interconnect performancepresents a major barrier to future high performancesystems. Previous research has focused on wire-centricdesigns that use parallelism, locality, and on-chipwiring bandwidth to compensate for long wire latency.An alternative approach to this problem is to exploitnewly-emerging on-chip transmission line technology toreduce communication latency. Compared to conventionalRC wires, transmission lines can reduce delay by up to afactor of 30 for global wires, while eliminating the needfor repeaters. However, this latency reduction comes at thecost of a comparable reduction in bandwidth.In this paper, we investigate using transmission linesto access large level-2 on-chip caches. We propose a familyof Transmission Line Cache (TLC) designs that representdifferent points in the latency/bandwidth spectrum.Compared to the recently-proposed Dynamic Non-UniformCache Architecture (DNUCA) design, the base TLCdesign reduces the required cache area by 18% andreduces the interconnection network's dynamic powerconsumption by an average of 61%. The optimized TLCdesigns attain similar performance using fewer transmis-sionlines but with some additional complexity. Simulationresults using full-system simulation show that TLC providesmore consistent performance than the DNUCAdesign across a wide variety of workloads. TLC caches arelogically simpler than DNUCA designs, but requiregreater circuit and manufacturing complexity.

[1]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[2]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[3]  Frank Tsui JSP - A Research Signal Processor in Josephson Technology , 1980, IBM J. Res. Dev..

[4]  A. Deutsch,et al.  Electrical characteristics of interconnections for high-performance systems , 1998, Proc. IEEE.

[5]  M.-C. Shiau,et al.  Delay models and speed improvement techniques for RC tree interconnections among small-geometry CMOS inverters , 1990 .

[6]  Trevor York,et al.  Book Review: Principles of CMOS VLSI Design: A Systems Perspective , 1986 .

[7]  R. Brayton,et al.  A novel VLSI layout fabric for deep sub-micron applications , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[8]  T. Xanthopoulos,et al.  The design and analysis of the clock distribution network for a 1.2 GHz Alpha microprocessor , 2001, 2001 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC (Cat. No.01CH37177).

[9]  J. Petrovick,et al.  The circuit and physical design of the POWER4 microprocessor , 2002, IBM J. Res. Dev..

[10]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[11]  M.A. Horowitz,et al.  Speed and power scaling of SRAM's , 2000, IEEE Journal of Solid-State Circuits.

[12]  Karthikeyan Sankaralingam,et al.  A design space evaluation of grid processor architectures , 2001, MICRO.

[13]  William J. Dally,et al.  Digital systems engineering , 1998 .

[14]  S. Wong,et al.  Near speed-of-light signaling over on-chip electrical interconnects , 2003 .

[15]  Kurt Keutzer,et al.  Getting to the bottom of deep submicron II: a global wiring paradigm , 1999, ISPD '99.

[16]  V. Rich Personal communication , 1989, Nature.

[17]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[18]  David R. O'Hallaron,et al.  Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers , 1998 .

[19]  Ching-Te Chuang Design considerations of SOI digital CMOS VLSI , 1998, 1998 IEEE International SOI Conference Proceedings (Cat No.98CH36199).

[20]  Neil Weste,et al.  Principles of CMOS VLSI Design , 1985 .

[21]  T. Mogami,et al.  Clock distribution networks with on-chip transmission lines , 2000, Proceedings of the IEEE 2000 International Interconnect Technology Conference (Cat. No.00EX407).

[22]  Milo M. K. Martin,et al.  Simulating a $ 2 M Commercial Server on a $ 2 K PC T , 2001 .

[23]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[24]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[25]  Vinod K. Agarwal,et al.  The Effect of Technology Scaling on Microarchitectural Structures , 2000 .

[26]  F. Wang,et al.  Ultra high speed SiGe NPN for advanced BiCMOS technology , 2001, International Electron Devices Meeting. Technical Digest (Cat. No.01CH37224).

[27]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[28]  A. R. Djordjević Matric parameters for multiconductor transmission lines : software and user's manual , 1989 .

[29]  David A. Wood,et al.  Full-system timing-first simulation , 2002, SIGMETRICS '02.

[30]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[31]  Gilbert Wolrich,et al.  A 300-MHz 64-b quad-issue CMOS RISC microprocessor , 1995 .

[32]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, MICRO.

[33]  Richard E. Kessler,et al.  Inexpensive Implementations Of Set-Associativity , 1989, The 16th Annual International Symposium on Computer Architecture.

[34]  David Wilkins,et al.  Implementation of a third-generation 1.1GHz 64b microprocessor , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[35]  D. A. Priore Inductance on silicon for sub-micron CMOS VLSI , 1993, Symposium 1993 on VLSI Circuits.

[36]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[37]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[38]  R. Allmon,et al.  A 300 MHz 64 b quad-issue CMOS RISC microprocessor , 1995, Proceedings ISSCC '95 - International Solid-State Circuits Conference.