论文信息 - Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Microprocessors are traditionally designed to provide "best overall" performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by "downsizing" hardware resources that are underutilized by the current application phase. Others have proposed a different energy-saving approach: dividing the processor into domains and dynamically changing the clock frequency and voltage within each domain during phases when the full domain frequency is not required. What has not been studied to date is how to exploit the adaptive nature of these approaches to improve performance rather than to save energy. In this paper, we describe an adaptive globally asynchronous, locally synchronous (GALS) microprocessor with a fixed global voltage and four independently clocked domains. Each domain is streamlined with modest hardware structures for very high clock frequency. Key structures can then be upsized on demand to exploit more distant parallelism, improve branch prediction, or increase cache capacity. Although doing so requires decreasing the associated domain frequency, other domain frequencies are unaffected. Our approach, therefore, is to maximize the throughput of each domain by finding the proper balance between the number of clock periods, and the clock frequency, for each application phase. To achieve this objective, we use novel hardware-based control techniques that accurately and efficiently capture the performance of all possible cache and queue configurations within a single interval, without having to resort to exhaustive online exploration or expensive offline profiling. Measuring across a broad suite of application benchmarks, we find that configuring our adaptive GALS processor just once per application yields 17.6% better performance, on average, than that of the "best overall" fully synchronous design. By adapting automatically to application phases, we can increase this advantage to more than 20%.

[1] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[2] Wolfgang Fichtner,et al. Globally-asynchronous locally-synchronous architectures to simplify the design of on-chip systems , 1999, Twelfth Annual IEEE International ASIC/SOC Conference (Cat. No.99TH8454).

[3] Veljko M. Milutinovic,et al. Pipeline Design Tradeoffs in a 32-bit Gallium Arsenide Microprocessor , 1991, IEEE Trans. Computers.

[4] David M. Brooks,et al. A circuit level implementation of an adaptive issue queue for power-aware microprocessors , 2001, GLSVLSI '01.

[5] Michael S. Hsiao,et al. Compiler-Directed Dynamic Frequency and Voltage Scheduling , 2000, PACS.

[6] Sandhya Dwarkadas,et al. Dynamic frequency and voltage control for a multiple clock domain microarchitecture , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[7] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.

[8] Michael L. Scott,et al. Integrating adaptive on-chip storage structures for reduced dynamic power , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[9] Eric Sprangle,et al. Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[10] Michael L. Scott,et al. Hiding synchronization delays in a GALS processor microarchitecture , 2004, 10th International Symposium on Asynchronous Circuits and Systems, 2004. Proceedings..

[11] R. Balasubramonian,et al. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[12] Norman P. Jouppi,et al. CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[13] Chris J. Myers,et al. Interfacing synchronous and asynchronous modules within a high-speed pipeline , 1997, Proceedings Seventeenth Conference on Advanced Research in VLSI.

[14] Michael L. Scott,et al. Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor , 2003, ISCA '03.

[15] T. Puzak,et al. The optimum pipeline depth for a microprocessor , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[16] M. Scott,et al. Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[17] David H. Albonesi,et al. Runtime Reconfiguration Techniques for Efficient General-Purpose Computation , 2000, IEEE Des. Test Comput..

[18] Reto Zimmermann. Computer Arithmetic: Principles, Architectures, and VLSI Design , 1999 .

[19] Christopher J. Hughes,et al. Joint local and global hardware adaptations for energy , 2002, ASPLOS X.

[20] Diana Marculescu. On the Use of Microarchitecture-Driven Dynamic Voltage Scaling , 2000 .

[21] Rami Melhem,et al. Adapting Processor Supply Voltage to Instruction-Level Parallelism , 2001 .

[22] S. McFarling. Combining Branch Predictors , 1993 .

[23] Antonio González,et al. Energy-effective issue logic , 2001, ISCA 2001.

[24] Michael L. Scott,et al. Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[25] Gürhan Küçük,et al. Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources , 2001, MICRO.

[26] Norman P. Jouppi,et al. Quantifying the Complexity of Superscalar Processors , 2002 .

[27] Brad Calder,et al. Time Varying Behavior of Programs , 1999 .

[28] Kaushik Roy,et al. Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[29] Earl E. Swartzlander,et al. Computer Arithmetic , 1980 .