Simulating a LAGS processor to consider variable latency on L1 D-Cache

Variability is one of the important issues in deep-submicron tecnologies, and the assumption of non-variable, constant latencies in the modules of deep-submicron processors can jeopardize their performance. Cache memories have demonstrated their data-dependent latency due to factors like the coupling capacitances or the distance between the port and the required data. In this paper we present, on one hand, a scheme to detect read operation completion on a variable latency cache memory. On the other hand, we present an asynchronous approach to improve processor performance using this feature. Hence, we propose a Locally-Asynchronous Globally-Synchronous (LAGS) superscalar microarchitecture in which read operations on a variable latency L1 data cache are managed through an asynchronous wrapper. In addition, we demonstrate its feasibility running SPEC2000 benchmarks on a 64-bit superscalar processor modeled through an architectural simulator. Simulations show speedups ranging up to 1.44 and averaging 1.22 over a non-variable cache design.

[1]  David A. Kearney,et al.  Theoretical limits on the data dependent performance of asynchronous circuits , 1999, Proceedings. Fifth International Symposium on Advanced Research in Asynchronous Circuits and Systems.

[2]  Fu-Chiung Cheng Practical design and performance evaluation of completion detection circuits , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[3]  José Ignacio Hidalgo,et al.  Sim-async: An Architectural Simulator for Asynchronous Processor Modeling Using Distribution Functions , 2006, Euro-Par.

[4]  Diana Marculescu,et al.  Power and performance evaluation of globally asynchronous locally synchronous processors , 2002, ISCA.

[5]  David H. Albonesi,et al.  Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches , 2007, HiPEAC.

[6]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[7]  Yehea I. Ismail,et al.  Variable latency caches for nanoscale processor , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[8]  Ed Grochowski,et al.  Implications of Device Timing Variability on Full Chip Timing , 2007, HPCA.

[9]  Mikko H. Lipasti,et al.  Silent Stores and Store Value Locality , 2001, IEEE Trans. Computers.

[10]  Rajeev Balasubramonian,et al.  Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.

[11]  David M. Brooks,et al.  Mitigating the Impact of Process Variations on Processor Register Files and Execution Units , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[12]  H. Bajwa,et al.  Low-Power High-Performance and Dynamically Configured Multi-Port Cache Memory Architecture , 2007, 2007 International Conference on Electrical Engineering.

[13]  Chris J. Myers,et al.  Interfacing synchronous and asynchronous modules within a high-speed pipeline , 1997, Proceedings Seventeenth Conference on Advanced Research in VLSI.

[14]  Kathryn Wilcox,et al.  Circuit implementation of a 600 MHz superscalar RISC microprocessor , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[15]  Kaushik Roy,et al.  Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[16]  Yehea I. Ismail,et al.  Accurate decoupling of capacitively coupled buses , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[17]  Michael L. Scott,et al.  Dynamically Trading Frequency for Complexity in a GALS Microprocessor , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[18]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[19]  J. M. Colmenar,et al.  Characterizing asynchronous variable latencies through probability distribution functions , 2009, Microprocess. Microsystems.

[20]  C. Morganti,et al.  The asynchronous 24MB on-chip level-3 cache for a dual-core Itanium/sup /spl reg//-family processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[21]  Norman P. Jouppi,et al.  The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[22]  Chung-Ho Chen,et al.  Microarchitecture support for improving the performance of load target prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[23]  David H. Albonesi Dynamic IPC/clock rate optimization , 1998, ISCA.

[24]  Ron Ho High-performance ULSI: the real limiter to interconnect scaling , 2005, SLIP '05.

[25]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.