论文信息 - Simulating a LAGS processor to consider variable latency on L1 D-Cache

Simulating a LAGS processor to consider variable latency on L1 D-Cache

Variability is one of the important issues in deep-submicron tecnologies, and the assumption of non-variable, constant latencies in the modules of deep-submicron processors can jeopardize their performance. Cache memories have demonstrated their data-dependent latency due to factors like the coupling capacitances or the distance between the port and the required data. In this paper we present, on one hand, a scheme to detect read operation completion on a variable latency cache memory. On the other hand, we present an asynchronous approach to improve processor performance using this feature. Hence, we propose a Locally-Asynchronous Globally-Synchronous (LAGS) superscalar microarchitecture in which read operations on a variable latency L1 data cache are managed through an asynchronous wrapper. In addition, we demonstrate its feasibility running SPEC2000 benchmarks on a 64-bit superscalar processor modeled through an architectural simulator. Simulations show speedups ranging up to 1.44 and averaging 1.22 over a non-variable cache design.

[1] David A. Kearney,et al. Theoretical limits on the data dependent performance of asynchronous circuits , 1999, Proceedings. Fifth International Symposium on Advanced Research in Asynchronous Circuits and Systems.

[2] Fu-Chiung Cheng. Practical design and performance evaluation of completion detection circuits , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[3] José Ignacio Hidalgo,et al. Sim-async: An Architectural Simulator for Asynchronous Processor Modeling Using Distribution Functions , 2006, Euro-Par.

[4] Diana Marculescu,et al. Power and performance evaluation of globally asynchronous locally synchronous processors , 2002, ISCA.

[5] David H. Albonesi,et al. Dynamic Capacity-Speed Tradeoffs in SMT Processor Caches , 2007, HiPEAC.

[6] Michael L. Scott,et al. Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[7] Yehea I. Ismail,et al. Variable latency caches for nanoscale processor , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[8] Ed Grochowski,et al. Implications of Device Timing Variability on Full Chip Timing , 2007, HPCA.

[9] Mikko H. Lipasti,et al. Silent Stores and Store Value Locality , 2001, IEEE Trans. Computers.

[10] Rajeev Balasubramonian,et al. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.

[11] David M. Brooks,et al. Mitigating the Impact of Process Variations on Processor Register Files and Execution Units , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[12] H. Bajwa,et al. Low-Power High-Performance and Dynamically Configured Multi-Port Cache Memory Architecture , 2007, 2007 International Conference on Electrical Engineering.

[13] Chris J. Myers,et al. Interfacing synchronous and asynchronous modules within a high-speed pipeline , 1997, Proceedings Seventeenth Conference on Advanced Research in VLSI.

[14] Kathryn Wilcox,et al. Circuit implementation of a 600 MHz superscalar RISC microprocessor , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[15] Kaushik Roy,et al. Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[16] Yehea I. Ismail,et al. Accurate decoupling of capacitively coupled buses , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[17] Michael L. Scott,et al. Dynamically Trading Frequency for Complexity in a GALS Microprocessor , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[18] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[19] J. M. Colmenar,et al. Characterizing asynchronous variable latencies through probability distribution functions , 2009, Microprocess. Microsystems.

[20] C. Morganti,et al. The asynchronous 24MB on-chip level-3 cache for a dual-core Itanium/sup /spl reg//-family processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[21] Norman P. Jouppi,et al. The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[22] Chung-Ho Chen,et al. Microarchitecture support for improving the performance of load target prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[23] David H. Albonesi. Dynamic IPC/clock rate optimization , 1998, ISCA.

[24] Ron Ho. High-performance ULSI: the real limiter to interconnect scaling , 2005, SLIP '05.

[25] A. J. KleinOsowski,et al. MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.