论文信息 - A co-designed virtual machine for instruction-level distributed processing

A co-designed virtual machine for instruction-level distributed processing

A current trend in high-performance superscalar processors is toward simpler designs that attempt to strike a balance between clock frequency, instruction-level parallelism, and power consumption. To achieve this goal, the thesis advocates a microarchitecture and design paradigm that rely less on low-level speculation techniques and more on simpler, modular designs with distributed processing at the instruction level, i.e., instruction-level distributed processing (ILDP). This thesis shows that designing a hardware/software co-designed virtual machine (VM) system using an accumulator-oriented instruction set architecture (ISA) and microarchitecture is a good approach for implementing complexity-effective, high-performance out-of-order superscalar machines. The following three key points support this conclusion. An accumulator-oriented instruction format and microarchitecture fit today's technology constraints better than conventional design approaches: The ILDP ISA format assigns temporary values that account for most of the register communication to a small number of accumulators. As a result, the complexity of the register file and associated hardware structures are greatly reduced. Furthermore, the dependence-oriented ILDP ISA format allows simple implementation of a complexity-effective distributed microarchitecture that is tolerant of global communication latencies. The accumulator-oriented instruction format and microarchitecture result in low-overhead dynamic binary translation (DBT): Because the underlying ILDP hardware provides a form of superscalar out-of-order processing, the dynamic binary translator does not need to perform aggressive optimizations. As a result, the dynamic binary translation overhead is greatly reduced. The co-designed VM system for ILDP performs similarly to, or better than, conventional superscalar processors having similar pipeline depths while achieving lower complexity in key pipeline structures: This reduction of complexity can be exploited to achieve either a higher clock frequency or lower power consumption, or a combination of the two. This thesis makes two main contributions. First, the major components of a co-designed VM for ILDP are fully developed: an accumulator-based ISA; a complexity-effective distributed microarchitecture; a fast and efficient DBT mechanism. Second, performance evaluations and complexity analysis support the key points of the thesis listed above.

James E. Smith | Ho-Seop Kim | James E. Smith | Ho-Seop Kim

[1] Mikko H. Lipasti,et al. Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[2] Manoj Franklin,et al. PEWs: a decentralized dynamic scheduler for ILP processing , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[3] Erik R. Altman,et al. Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[4] Jaume Abella,et al. Power- and Complexity-Aware Issue Queue Designs , 2003, IEEE Micro.

[5] James E. Smith,et al. Complexity-Effective Superscalar Processors , 1997, ISCA.

[6] J. E. Thornton. Design of a Computer: The Control Data 6600 , 1970 .

[7] Gurindar S. Sohi,et al. Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors , 1992, MICRO 1992.

[8] James E. Smith,et al. Using dynamic binary translation to fuse dependent instructions , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[9] Andreas Moshovos,et al. Streamlining inter-operation memory communication via data dependence prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[10] B. Miller,et al. Dynamic Kernel I-Cache Optimization , 1998 .

[11] Rastislav Bodík,et al. Focusing processor policies via critical-path prediction , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[12] Kenneth C. Yeager. The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[13] Ramon Canal,et al. Dynamic cluster assignment mechanisms , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[14] B. Calder,et al. A scalable front-end architecture for fast instruction delivery , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[15] David Gregg,et al. The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures , 2001, Euro-Par.

[16] Gurindar S. Sohi,et al. ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[17] Dirk Grunwald,et al. Reducing indirect function call overhead in C++ programs , 1994, POPL '94.

[18] Trevor N. Mudge,et al. Virtual memory in contemporary microprocessors , 1998, IEEE Micro.

[19] Peter G. Sassone,et al. Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[20] Jeffrey Dean,et al. ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[21] Enric Morancho,et al. Recovery mechanism for latency misprediction , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[22] Norman P. Jouppi,et al. The multicluster architecture: reducing cycle time through partitioning , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[23] Anne Rogers,et al. The performance impact of incomplete bypassing in processor pipelines , 1995, MICRO 1995.

[24] Andreas Moshovos,et al. Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[25] Ramon Canal,et al. A low-complexity issue logic , 2000, ICS '00.

[26] Tulika Mitra,et al. Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences , 1997, ISCA.

[27] Pierre Michaud,et al. Data-flow prescheduling for large instruction windows in out-of-order processors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[28] M. Gschwind,et al. On Achieving Precise Exceptions Semantics in Dynamic Optimization , 2000 .

[29] Gurindar S. Sohi,et al. Use-based register caching with decoupled indexing , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[30] Cristina Cifuentes,et al. Dynamic Binary Translation , 2000 .

[31] Yun Wang,et al. IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems , 2003, MICRO.

[32] Daniel A. Jiménez,et al. The impact of delay on the design of branch predictors , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[33] James E. Smith,et al. Characterizing computer performance with a single number , 1988, CACM.

[34] James E. Smith,et al. Dynamic instruction scheduling and the Astronautics ZS-1 , 1989, Computer.

[35] John Yates,et al. FX!32 a profile-directed binary translator , 1998, IEEE Micro.

[36] Trevor N. Mudge,et al. Power: A First-Class Architectural Design Constraint , 2001, Computer.

[37] Chris Wilkerson,et al. Hierarchical scheduling windows , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[38] Ruben W. Castelino,et al. Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor , 1995, Digit. Tech. J..

[39] Richard Phelan. Improving ARM Code Density and Performance , 2003 .

[40] Mateo Valero,et al. Trace cache redundancy: red and blue traces , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[41] Kim Hazelwood,et al. Generational cache management of code traces in dynamic optimization systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[42] David Keppel,et al. Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.

[43] Scott A. Mahlke,et al. The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[44] 裕幸飯田,et al. International Technology Roadmap for Semiconductors 2003の要求清浄度について－シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について－ , 2004 .

[45] Bich C. Le,et al. An out-of-order execution technique for runtime binary translators , 1998, ASPLOS VIII.

[46] Manoj Franklin,et al. A fill-unit approach to multiple instruction issue , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[47] Ronak Singhal,et al. Performance Analysis and Validation of the Intel Pentium 4 Processor on 90nm Technology , 2004 .

[48] André Seznec,et al. Effective ahead pipelining of instruction block address generation , 2003, ISCA '03.

[49] Doug Matzke,et al. Will Physical Scalability Sabotage Performance Gains? , 1997, Computer.

[50] James E. Smith,et al. Dynamic binary translation for accumulator-oriented architectures , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[51] S. Tomita,et al. A high-speed dynamic instruction scheduling scheme for supersealar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[52] Robert C. Bedichek. Talisman: fast and accurate multicomputer simulation , 1995, SIGMETRICS '95/PERFORMANCE '95.

[53] Mendel Rosenblum,et al. Embra: fast and flexible machine simulation , 1996, SIGMETRICS '96.

[54] Narayanan Vijaykrishnan,et al. Exploring Wakeup-Free Instruction Scheduling , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[55] Gurindar S. Sohi,et al. A programmable co-processor for profiling , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[56] Norman P. Jouppi,et al. CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[57] James E. Smith,et al. Relational profiling: enabling thread-level parallelism in virtual machines , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[58] Paolo Faraboschi,et al. DELI: a new run-time control point , 2002, MICRO.

[59] Dirk Grunwald,et al. Fast and accurate instruction fetch and branch prediction , 1994, ISCA '94.

[60] Michael Franz,et al. Continuous Program Optimization: Design and Evaluation , 2001, IEEE Trans. Computers.

[61] D. Marr,et al. Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[62] William J. Dally,et al. Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[63] John Whaley. Partial method compilation using dynamic profile information , 2001, OOPSLA '01.

[64] John Paul Shen,et al. Instruction path coprocessors , 2000, ISCA '00.

[65] Sanjay J. Patel,et al. Performance characterization of a hardware mechanism for dynamic optimization , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[66] Norman P. Jouppi,et al. Register file design considerations in dynamically scheduled processors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[67] Mayan Moudgill,et al. Environment for PowerPC microarchitecture exploration , 1999, IEEE Micro.

[68] Vasanth Bala,et al. Transparent Dynamic Optimization: The Design and Implementation of Dynamo , 1999 .

[69] Kemal Ebcioglu,et al. An architectural framework for supporting heterogeneous instruction-set architectures , 1993, Computer.

[70] Haitham Akkary,et al. A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[71] Mateo Valero,et al. The effect of code reordering on branch prediction , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[72] Tong Li,et al. A large, fast instruction window for tolerating cache misses , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[73] Wen-mei W. Hwu,et al. A hardware mechanism for dynamic extraction and relayout of program hot spots , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[74] D.R. Kaeli,et al. Branch history table prediction of moving target branches due to subroutine returns , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[75] Krste Asanovic,et al. Banked multiported register files for high-frequency superscalar microprocessors , 2003, ISCA '03.

[76] Henry Hoffmann,et al. Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[77] Scott Devine,et al. Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[78] Wen-mei W. Hwu,et al. Code reordering and speculation support for dynamic optimization systems , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[79] Rajeev Balasubramonian,et al. Reducing the complexity of the register file in dynamic superscalar processors , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[80] Mikko H. Lipasti,et al. Precise and Accurate Processor Simulation , 2002 .

[81] James R. Bell,et al. Threaded code , 1973, CACM.

[82] Ramon Canal,et al. A cost-effective clustered architecture , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[83] Cindy Zheng,et al. PA-RISC to IA-64: Transparent Execution, No Recompilation , 2000, Computer.

[84] Mikko H. Lipasti,et al. Understanding scheduling replay schemes , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[85] Raymond J. Hookway,et al. DIGITAL FX!32: Combining Emulation and Binary Translation , 1997, Digit. Tech. J..

[86] Avi Mendelson,et al. Filtering techniques to improve trace-cache efficiency , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[87] Paul Klint,et al. Interpretation Techniques , 1981, Softw. Pract. Exp..

[88] Pradip Bose,et al. Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design , 2003 .

[89] Jack W. Davidson,et al. Strata: A Software Dynamic Translation Infrastructure , 2001 .

[90] Vivek Sarkar,et al. Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[91] Yale N. Patt,et al. Putting the fill unit to work: dynamic optimizations for trace cache microprocessors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[92] Michael Gschwind,et al. Dynamic and Transparent Binary Translation , 2000, Computer.

[93] Richard E. Kessler,et al. The Alpha 21264 microprocessor , 1999, IEEE Micro.

[94] R. D. Barnes,et al. An Architectural Framework for Run-Time Optimization , 2001 .

[95] Gurindar S. Sohi,et al. Speculative Multithreaded Processors , 2001, Computer.

[96] T. Austin,et al. Cyclone: a broadcast-free dynamic instruction scheduler with selective replay , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[97] Balaram Sinharoy,et al. POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[98] Vivek Sarkar,et al. Baring It All to Software: Raw Machines , 1997, Computer.

[99] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[100] Mikko H. Lipasti,et al. Macro-op Scheduling: Relaxing Scheduling Loop Constraints , 2003, MICRO.

[101] Keith Diefendorff. K7 Challenges Intel: 10/26/98 , 1998 .

[102] Gurindar S. Sohi,et al. Characterizing and predicting value degree of use , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[103] Michael Gschwind,et al. Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.

[104] Mikko H. Lipasti,et al. Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[105] Robert S. Cohn,et al. Hot cold optimization of large Windows/NT applications , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[106] James E. Smith,et al. Rapid profiling via stratified sampling , 2001, ISCA 2001.

[107] Bradley C. Kuszmaul,et al. Circuits for wide-window superscalar processors , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[108] Woody Lichtenstein,et al. The multiflow trace scheduling compiler , 1993, The Journal of Supercomputing.

[109] M.J. Flynn,et al. Deep submicron microprocessor design issues , 1999, IEEE Micro.

[110] James E. Smith,et al. Optimal Pipelining in Supercomputers , 1986, ISCA.

[111] Mikko H. Lipasti,et al. Half-price architecture , 2003, ISCA '03.

[112] Rastislav Bodík,et al. Slack: maximizing performance under technological constraints , 2002, ISCA.

[113] Andrew R. Pleszkun,et al. Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[114] Evelyn Duesterwald,et al. Design and implementation of a dynamic optimization framework for windows , 2000 .

[115] Luiz André Barroso,et al. Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[116] Olivier Temam,et al. MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[117] James E. Smith,et al. The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[118] Quinn Jacobson,et al. Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[119] M. Merten,et al. A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[120] John Paul Shen,et al. Scalable Register Renaming via the Quack Register File , 2000 .

[121] Eric Sprangle,et al. Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[122] Ken Mai,et al. The future of wires , 2001, Proc. IEEE.

[123] M. K. Gschwind,et al. Method and apparatus for determining branch addresses in programs generated by binary translation , 1998 .

[124] Peter S. Magnusson,et al. A Compact Intermediate Format for SimICS , 1994 .

[125] Pascal Sainrat,et al. Multiple-block ahead branch predictors , 1996, ASPLOS VII.

[126] Takeo Asakawa,et al. Microarchitecture and performance analysis of a SPARC-V9 microprocessor for enterprise server systems , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[127] Vinod K. Agarwal,et al. The Effect of Technology Scaling on Microarchitectural Structures , 2000 .

[128] Stéphan Jourdan,et al. Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[129] Mateo Valero,et al. Multiple-banked register file architectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[130] Yale N. Patt,et al. On pipelining dynamic instruction scheduling logic , 2000, MICRO 33.

[131] Michael J. Flynn,et al. Optimal Pipelining , 1990, J. Parallel Distributed Comput..

[132] H. B. Bakoglu,et al. The IBM RISC System/6000 Processor: Hardware Overview , 1990, IBM J. Res. Dev..

[133] Derek Bruening,et al. An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[134] Todd M. Austin,et al. Efficient dynamic scheduling through tag elimination , 2002, ISCA.

[135] Steven K. Reinhardt,et al. A scalable instruction queue design using dependence chains , 2002, ISCA.

[136] Venkatesh Akella,et al. Synchroscalar: a multiple clock domain, power-aware, tile-based embedded processor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[137] Laurie J. Hendren,et al. Dynamic profiling and trace cache generation , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[138] Carlo H. Séquin,et al. Design Considerations for Single-Chip Computers of the Future , 1980, IEEE Transactions on Computers.

[139] James E. Smith,et al. Instruction Level Distributed Processing , 2000, HiPC.

[140] Sheldon B. Levenstein,et al. Architecture, design, and performance of Application System/400 (AS/400) multiprocessors , 1992, IBM J. Res. Dev..

[141] Jack W. Davidson,et al. Profile guided code positioning , 1990, SIGP.

[142] G.E. Moore,et al. No exponential is forever: but "Forever" can be delayed! [semiconductor industry] , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[143] Michael D. Smith,et al. Code cache management schemes for dynamic optimizers , 2002, Proceedings Sixth Annual Workshop on Interaction between Compilers and Computer Architectures.

[144] Guang R. Gao,et al. An investigation of the performance of various instruction-issue buffer topologies , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[145] James E. Smith,et al. PowerPC 601 and Alpha 21064: a tale of two RISCs , 1994, Computer.

[146] James E. Smith,et al. The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[147] D. Grunwald,et al. Fast & Accurate Instruction Fetch and Branch Prediction , 1994 .

[148] Michael Gschwind,et al. Optimizations and oracle parallelism with dynamic translation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[149] Mark D. Hill,et al. Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[150] J.E. Smith,et al. Achieving high performance via co-designed virtual machines , 1998, Innovative Architecture for Future Generation High-Performance Processors and Systems.

[151] A. Klaiber. The Technology Behind Crusoe TM Processors Low-power x 86-Compatible Processors Implemented with Code Morphing , 2000 .

[152] David J. Sager,et al. The microarchitecture of the Pentium 4 processor , 2001 .

[153] John L. Henning. SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[154] James E. Smith,et al. Hardware support for control transfers in code caches , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[155] Thomas R. Puzak,et al. Optimum power/performance pipeline depth , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[156] R. M. Tomasulo,et al. An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[157] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[158] Stéphan Jourdan,et al. A novel renaming scheme to exploit value temporal locality through physical register reuse and unification , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[159] M.A. Horowitz,et al. Speed and power scaling of SRAM's , 2000, IEEE Journal of Solid-State Circuits.

[160] Eric Rotenberg,et al. Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[161] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .

[162] Trent Jaeger,et al. An unconventional proposal: using the x86 architecture as the ubiquitous virtual standard architecture , 1998, EW 8.

[163] Kunle Olukotun,et al. Designing High Bandwidth On-Chip Caches , 1997, ISCA.

[164] R. Balasubramonian,et al. Dynamically managing the communication-parallelism trade-off in future clustered processors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[165] Guang R. Gao,et al. Minimum Register Instruction Sequencing to Reduce Register Spills in Out-of-Order Issue Superscalar Architectures , 2003, IEEE Trans. Computers.

[166] Doug Burger,et al. Measuring Experimental Error in Microprocessor Simulation , 2001, ISCA 2001.

[167] Burzin A. Patel,et al. Optimization of instruction fetch mechanisms for high issue rates , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[168] R. D. Valentine,et al. The Intel Pentium M processor: Microarchitecture and performance , 2003 .

[169] Vasanth Bala,et al. Software Profiling for Hot Path Prediction: Less is More , 2000, ASPLOS.

[170] Joel S. Emer,et al. Loose loops sink chips , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[171] Bruce Jacob,et al. Concurrency, latency, or system overhead: Which has the largest impact on uniprocessor DRAM-system performance? , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[172] Gurindar S. Sohi,et al. Dynamic dead-instruction detection and elimination , 2002, ASPLOS X.

[173] James E. Smith,et al. A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[174] Pat Conway,et al. The AMD Opteron Processor for Multiprocessor Servers , 2003, IEEE Micro.

[175] Matthew Arnold,et al. Adaptive optimization in the Jalapeño JVM , 2000, OOPSLA '00.

[176] Mateo Valero,et al. Delaying physical register allocation through virtual-physical registers , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[177] Brian N. Bershad,et al. Execution characteristics of desktop applications on Windows NT , 1998, ISCA.

[178] Gurindar S. Sohi,et al. An empirical analysis of instruction repetition , 1998, ASPLOS VIII.

[179] Sumedh W. Sathaye,et al. Dynamic rescheduling: a technique for object code compatibility in VLIW architectures , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[180] Neil C. Wilhelm,et al. Caching processor general registers , 1995, Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors.

[181] Milo M. K. Martin,et al. Exploiting dead value information , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[182] R. Bedicheck. Some efficient architecture simulation tech-niques , 1990 .

[183] T. Puzak,et al. The optimum pipeline depth for a microprocessor , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[184] Mateo Valero,et al. Fetching instruction streams , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[185] Erik R. Altman,et al. BOA: The Architecture of a Binary Translation Processor , 1999 .

[186] Vikram S. Adve,et al. LLVA: a low-level virtual instruction set architecture , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[187] Richard Johnson,et al. The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[188] Fischer. Issue Logic For A 600 MHz Out-of-order Execution , 1997, Symposium 1997 on VLSI Circuits.

[189] Daniel H. Friendly,et al. Evaluation of Design Options for the Trace Cache Fetch Mechanism , 1999, IEEE Trans. Computers.

[190] J. M. Codina,et al. Instruction replication for clustered microarchitectures , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[191] Quinn Jacobson,et al. Instruction pre-processing in trace processors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[192] Vikas Agarwal,et al. Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[193] John Paul Shen,et al. Parallel cachelets , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[194] R. Nagarajan,et al. A design space evaluation of grid processor architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[195] Nader Bagherzadeh,et al. A scalable register file architecture for dynamically scheduled processors , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[196] T. N. Vijaykumar,et al. Reducing register ports for higher speed and lower energy , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[197] Jan M. Van Campenhout,et al. Interpretation and instruction path coprocessing , 1990, Computer systems.

[198] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.

[199] Yale N. Patt,et al. Select-free instruction scheduling logic , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[200] Sorin Lerner,et al. Mojo: A Dynamic Optimization System , 2000 .

[201] K. Ebcioglu,et al. Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[202] Andreas Moshovos,et al. Memory dependence speculation tradeoffs in centralized, continuous-window superscalar processors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[203] Yun Wang,et al. IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium/spl reg/-based systems , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[204] Rolf Ernst,et al. Codesign of Embedded Systems: Status and Trends , 1998, IEEE Des. Test Comput..

[205] Haitham Akkary,et al. Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[206] Mateo Valero,et al. Software Trace Cache , 2014, IEEE Transactions on Computers.

[207] Mike Johnson,et al. Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[208] S SohiGurindar. Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers , 1990 .

[209] Stephen H. Gunther,et al. Managing the Impact of Increasing Microprocessor Power Consumption , 2001 .

[210] James E. Smith,et al. Instruction Issue Logic in Pipelined Supercomputers , 1984, IEEE Trans. Computers.

[211] Philippe Roussel,et al. The microarchitecture of the intel pentium 4 processor on 90nm technology , 2004 .

[212] Shekhar Y. Borkar,et al. Design challenges of technology scaling , 1999, IEEE Micro.

[213] Ho-Seop Kim,et al. An instruction set and microarchitecture for instruction level distributed processing , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[214] Trevor N. Mudge,et al. Integrating superscalar processor components to implement register caching , 2001, ICS '01.

[215] Ravi Nair,et al. Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups , 1997, ISCA.

[216] Yale N. Patt,et al. Partitioned first-level cache design for clustered microarchitectures , 2003, ICS '03.

[217] Gurindar S. Sohi,et al. Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.