Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors

This book gives a comprehensive description of the architecture of microprocessors from simple in-order short pipeline designs to out-of-order superscalars. It discusses topics such as - the policies and mechanisms needed for out-of-order processing such as register renaming, reservation stations, and reorder buffers - optimizations for high performance such as branch predictors, instruction scheduling, and load-store speculations - design choices and enhancements to tolerate latency in the cache hierarchy of single and multiple processors - state-of-the-art multithreading and multiprocessing emphasizing single chip implementations Topics are presented as conceptual ideas, with metrics to assess the performance impact, if appropriate, and examples of realization. The emphasis is on how things work at a black box and algorithmic level. The author also provides sufficient detail at the register transfer level so that readers can appreciate how design features enhance performance as well as complexity.

[1]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[2]  Brad Calder,et al.  Dynamic prediction of critical path instructions , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[3]  Zarka Cvetanovic,et al.  Performance characterization of the Alpha 21164 microprocessor using TP and SPEC workloads , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[4]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[5]  Peter J. Denning Virtual Memory , 1996, ACM Comput. Surv..

[6]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[7]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[8]  Trevor N. Mudge,et al.  A comparison of two pipeline organizations , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Harsh Sharangpani,et al.  Itanium Processor Microarchitecture , 2000, IEEE Micro.

[10]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[11]  James E. Smith,et al.  The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[12]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[13]  Hans Mulder,et al.  Introducing the IA-64 Architecture , 2000, IEEE Micro.

[14]  Manoj Franklin,et al.  Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors , 2005, IEEE Trans. Parallel Distributed Syst..

[15]  Trevor N. Mudge,et al.  The YAGS branch prediction scheme , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[16]  Scott Shenker,et al.  Scheduling for reduced CPU energy , 1994, OSDI '94.

[17]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[18]  Gurindar S. Sohi,et al.  Speculative Multithreaded Processors , 2001, Computer.

[19]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[20]  B DennisJack,et al.  A preliminary architecture for a basic data-flow processor , 1974 .

[21]  Norman P. Jouppi,et al.  The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[22]  Ken Mai,et al.  The future of wires , 2001, Proc. IEEE.

[23]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[24]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[25]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[26]  Dirk Grunwald,et al.  Next cache line and set prediction , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[27]  N PattYale,et al.  Alternative implementations of two-level adaptive branch prediction , 1992 .

[28]  HuangWei,et al.  Temperature-aware microarchitecture , 2003 .

[29]  V. Klema LINPACK user's guide , 1980 .

[30]  Shreekant S. Thakkar,et al.  The Symmetry Multiprocessor System , 1988, ICPP.

[31]  Marc Tremblay,et al.  High-performance throughput computing , 2005, IEEE Micro.

[32]  James R. Larus,et al.  Transactional memory , 2008, CACM.

[33]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[34]  Dave Christie Developing the AMD-K5 architecture , 1996, IEEE Micro.

[35]  Daniel Citron,et al.  The harmonic or geometric mean: does it really matter? , 2006, CARN.

[36]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[37]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[38]  Frank P.E. Baetke The CONVEX Exemplar SPP1000 and SPP1200—New Scalable Parallel Systems with a Virtual Shared Memory Architecture , 1995 .

[39]  Yale N. Patt,et al.  Recovery requirements of branch prediction storage structures in the presence of mispredicted-path execution , 2007, International Journal of Parallel Programming.

[40]  David A. Patterson,et al.  Computer Organization & Design: The Hardware/Software Interface , 1993 .

[41]  M. Tremblay,et al.  UltraSparc I: a four-issue processor supporting multimedia , 1996, IEEE Micro.

[42]  Yale N. Patt,et al.  On pipelining dynamic instruction scheduling logic , 2000, MICRO 33.

[43]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[44]  C.B. Stunkel,et al.  A New Switch Chip for IBM RS/6000 SP Systems , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[45]  Lizy Kurian John,et al.  More on finding a single number to indicate overall performance of a benchmark suite , 2004, CARN.

[46]  L. W. Tucker,et al.  Architecture and applications of the Connection Machine , 1988, Computer.

[47]  Michael Franz,et al.  Power reduction techniques for microprocessor systems , 2005, CSUR.

[48]  James K. Archibald,et al.  Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[49]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[50]  E AndersonThomas,et al.  Execution characteristics of desktop applications on Windows NT , 1998 .

[51]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[52]  Larry Rudolph,et al.  Dynamic decentralized cache schemes for mimd parallel processors , 1984, ISCA 1984.

[53]  J. S. Liptay,et al.  Design of the IBM Enterprise System/9000 high-end processor , 1992, IBM J. Res. Dev..

[54]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[55]  Robert M. Keller,et al.  Look-Ahead Processors , 1975, CSUR.

[56]  Marc Tremblay,et al.  The MAJC Architecture: A Synthesis of Parallelism and Scalability , 2000, IEEE Micro.

[57]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[58]  Dirk Grunwald,et al.  Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[59]  Brian N. Bershad,et al.  Execution characteristics of desktop applications on Windows NT , 1998, ISCA.

[60]  Yale N. Patt,et al.  HPSm, a high performance restricted data flow architecture having minimal functionality , 1986, ISCA '98.

[61]  Craig B. Zilles,et al.  A criticality analysis of clustering in superscalar processors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[62]  Yale N. Patt,et al.  A comprehensive instruction fetch mechanism for a processor supporting speculative execution , 1992, MICRO 1992.

[63]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[64]  Emerson W. Pugh,et al.  IBM's 360 and early 370 systems , 1991 .

[65]  Trevor N. Mudge,et al.  Analysis of branch prediction via data compression , 1996, ASPLOS VII.

[66]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[67]  Pat Conway,et al.  The AMD Opteron Processor for Multiprocessor Servers , 2003, IEEE Micro.

[68]  G.S. Sohi,et al.  Dynamic Speculation And Synchronization Of Data Dependence , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[69]  Maurice V. Wilkes,et al.  Slave Memories and Dynamic Storage Allocation , 1965, IEEE Trans. Electron. Comput..

[70]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[71]  Donald Yeung,et al.  The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[72]  Yoichi Muraoka,et al.  TRANQUIL: a language for an array processing computer , 1969, AFIPS '69 (Spring).

[73]  Dileep Bhandarkar Alpha implementations and architecture - complete reference and guide , 1996 .

[74]  David B. Papworth Tuning the Pentium Pro microarchitecture , 1996, IEEE Micro.

[75]  Trevor N. Mudge,et al.  High-Performance DRAMs in Workstation Environments , 2001, IEEE Trans. Computers.

[76]  Carl J. Conti,et al.  Structural Aspects of the System/360 Model 85 I: General Organization , 1968, IBM Syst. J..

[77]  Richard Crisp,et al.  Direct RAMbus technology: the new main memory standard , 1997, IEEE Micro.

[78]  Allan Hartstein,et al.  The optimum pipeline depth for a microprocessor , 2002, ISCA.

[79]  Rastislav Bodík,et al.  Slack: maximizing performance under technological constraints , 2002, ISCA.

[80]  J. E. Thornton,et al.  Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).

[81]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[82]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[83]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[84]  Michael J. Flynn,et al.  Very high-speed computing systems , 1966 .

[85]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[86]  Richard E. Kessler,et al.  Performance analysis of the Alpha 21264-based Compaq ES40 system , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[87]  S. F. Anderson,et al.  The IBM system/360 model 91: floating-point execution unit , 1967 .

[88]  R. E. Kessler,et al.  Inexpensive implementations of set-associativity , 1989, ISCA '89.

[89]  Margaret Martonosi,et al.  Speculative Updates of Local and Global Branch History: A Quantitative Analysis , 2000, J. Instr. Level Parallelism.

[90]  Brian A. Wichmann,et al.  A Synthetic Benchmark , 1976, Comput. J..

[91]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[92]  Barry Fagin,et al.  Partial resolution in branch target buffers , 1995, MICRO 1995.

[93]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[94]  Trevor Mudge,et al.  Drowsy instruction caches. Leakage power reduction using dynamic voltage scaling and cache sub-bank prediction , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[95]  Dirk Grunwald,et al.  Fast and accurate instruction fetch and branch prediction , 1994, ISCA '94.

[96]  Edward M. Riseman,et al.  The Inhibition of Potential Parallelism by Conditional Jumps , 1972, IEEE Transactions on Computers.

[97]  Michael J. Flynn,et al.  Detection and Parallel Execution of Independent Instructions , 1970, IEEE Transactions on Computers.

[98]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[99]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[100]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[101]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[102]  Mikko H. Lipasti,et al.  Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[103]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[104]  Alan Jay Smith,et al.  Multimedia extensions for general purpose microprocessors: a survey , 2005, Microprocess. Microsystems.

[105]  Susan J. Eggers,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, ISCA.

[106]  J. E. Thornton Design of a Computer: The Control Data 6600 , 1970 .

[107]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[108]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[109]  James Archibald,et al.  An economical solution to the cache coherence problem , 1984, ISCA 1984.

[110]  Avi Mendelson,et al.  CMP Implementation in Systems Based on the Intel Core Duo Processor , 2006 .

[111]  KubiatowiczJohn,et al.  The MIT Alewife machine , 1995 .

[112]  Alon Naveh,et al.  Power and Thermal Management in the Intel Core Duo Processor , 2006 .

[113]  Steven R. Kunkel,et al.  A multithreaded PowerPC processor for commercial servers , 2000, IBM J. Res. Dev..

[114]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[115]  Shreekant S. Thakkar,et al.  Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.

[116]  Kanad Ghose,et al.  Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[117]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[118]  Peter M. Kogge,et al.  The Architecture of Pipelined Computers , 1981 .

[119]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[120]  James E. Smith,et al.  A study of branch prediction strategies , 1981, ISCA '98.

[121]  Francis F. Lee,et al.  Study of "Look-Aside" Memory , 1969, IEEE Transactions on Computers.

[122]  Scott A. Mahlke,et al.  Integrated predicated and speculative execution in the IMPACT EPIC architecture , 1998, ISCA.

[123]  John H. Edmondson,et al.  Superscalar instruction execution in the 21164 Alpha microprocessor , 1995, IEEE Micro.

[124]  Leonard Kleinrock,et al.  Virtual Cut-Through: A New Computer Communication Switching Technique , 1979, Comput. Networks.

[125]  Alan Jay Smith,et al.  Functional Implementation Techniques for CPU Cache Memories , 1999, IEEE Trans. Computers.

[126]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[127]  Thomas M. Conte,et al.  Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[128]  Jack W. Davidson,et al.  Profile guided code positioning , 1990, SIGP.

[129]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[130]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[131]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[132]  Chris Wilkerson,et al.  Locality vs. criticality , 2001, ISCA 2001.

[133]  James E. Smith,et al.  Characterizing computer performance with a single number , 1988, CACM.

[134]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[135]  Antonio González,et al.  Energy-effective issue logic , 2001, ISCA 2001.

[136]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[137]  Joseph T. Rahmeh,et al.  Improving the accuracy of dynamic branch prediction using branch correlation , 1992, ASPLOS V.

[138]  D.R. Kaeli,et al.  Branch history table prediction of moving target branches due to subroutine returns , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[139]  Yale N. Patt,et al.  Alternative implementations of two-level adaptive branch prediction , 1992, ISCA '92.

[140]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[141]  Andris Padegs,et al.  Architecture of the IBM system/370 , 1978, CACM.

[142]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[143]  David W. Anderson,et al.  The IBM System/360 model 91: machine philosophy and instruction-handling , 1967 .

[144]  Tom Kilburn,et al.  One-Level Storage System , 1962, IRE Trans. Electron. Comput..

[145]  RonenRonny,et al.  Speculation techniques for improving load related instruction scheduling , 1999 .

[146]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[147]  Rajeev Balasubramonian,et al.  Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.

[148]  John D. McCalpin,et al.  Characterization of simultaneous multithreading (SMT) efficiency in POWER5 , 2005, IBM J. Res. Dev..

[149]  Glenn Reinman,et al.  A Comparative Survey of Load Speculation Architectures , 2000, J. Instr. Level Parallelism.

[150]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[151]  Joel S. Emer,et al.  Memory dependence prediction using store sets , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[152]  P JouppiNorman Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990 .

[153]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[154]  D. Burger,et al.  Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[155]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[156]  Michael C. Huang,et al.  Dynamically Tuning Processor Resources with Adaptive Processing , 2003, Computer.

[157]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[158]  JosephDoug,et al.  Prefetching using Markov predictors , 1997 .

[159]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[160]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[161]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[162]  Alec Wolman,et al.  The structure and performance of interpreters , 1996, ASPLOS VII.

[163]  Ramon Canal,et al.  Dynamic cluster assignment mechanisms , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[164]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[165]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[166]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[167]  Norman P. Jouppi,et al.  Performance of image and video processing with general-purpose processors and media ISA extensions , 1999, ISCA.

[168]  Yale N. Patt,et al.  The effect of speculatively updating branch history on branch prediction accuracy, revisited , 1994, MICRO 27.

[169]  Chris H. Perleberg,et al.  Branch Target Buffer Design and Optimization , 1993, IEEE Trans. Computers.

[170]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[171]  J DallyWilliam Virtual-channel flow control , 1990 .

[172]  Daniel A. Jiménez,et al.  The impact of delay on the design of branch predictors , 2000, MICRO 33.

[173]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[174]  Burzin A. Patel,et al.  Optimization of instruction fetch mechanisms for high issue rates , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[175]  Reinhold Weicker,et al.  Dhrystone: a synthetic systems programming benchmark , 1984, CACM.

[176]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[177]  Z ChrysosGeorge,et al.  Memory dependence prediction using store sets , 1998 .

[178]  Arthur J. Bernstein,et al.  Analysis of Programs for Parallel Processing , 1966, IEEE Trans. Electron. Comput..

[179]  Dezsö Sima,et al.  The Design Space of Register Renaming Techniques , 2000, IEEE Micro.

[180]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[181]  AdveSarita,et al.  Performance of image and video processing with general-purpose processors and media ISA extensions , 1999 .

[182]  Wei-Fen Lin,et al.  Designing a Modern Memory Hierarchy with Hardware Prefetching , 2001, IEEE Trans. Computers.

[183]  Philip Levis,et al.  Policies for dynamic clock scheduling , 2000, OSDI.

[184]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[185]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[186]  Stéphan Jourdan,et al.  Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[187]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[188]  Brad Calder,et al.  Discovering and Exploiting Program Phases , 2003, IEEE Micro.

[189]  Gurindar S. Sohi,et al.  Instruction Issue Logic for High-Performance Interruptible, Multiple Functional Unit, Pipelines Computers , 1990, IEEE Trans. Computers.

[190]  David R. Kaeli,et al.  Analysis of Temporal-Based Program Behavior for Improved Instruction Cache Performance , 1999, IEEE Trans. Computers.

[191]  Peter Petrov,et al.  Transforming binary code for low-power embedded processors , 2004, IEEE Micro.

[192]  Sumedh W. Sathaye,et al.  A technique for object code compatibility in VLIW architectures , 1995, MICRO 1995.

[193]  Daniel H. Friendly,et al.  Evaluation of Design Options for the Trace Cache Fetch Mechanism , 1999, IEEE Trans. Computers.

[194]  Carlo H. Séquin,et al.  RISC I: a reduced instruction set VLSI computer , 1981, ISCA '98.

[195]  B. Ramakrishna Rau,et al.  EPIC: Explicititly Parallel Instruction Computing , 2000, Computer.

[196]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[197]  Gary S. Tyson,et al.  Performance Limits of Trace Caches , 1999, J. Instr. Level Parallelism.

[198]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[199]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[200]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB /sup1/ , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[201]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[202]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[203]  Susan J. Eggers,et al.  Balanced scheduling: instruction scheduling when memory latency is uncertain , 1993, PLDI '93.

[204]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.