论文信息 - Reducing exception management overhead with software restart markers

Reducing exception management overhead with software restart markers

Modern processors rely on exception handling mechanisms to detect errors and to implement various features such as virtual memory. However, these mechanisms are typically hardware-intensive because of the need to buffer partially-completed instructions to implement precise exceptions and enforce in-order instruction commit, often leading to issues with performance and energy efficiency. The situation is exacerbated in highly parallel machines with large quantities of programmer-visible state, such as VLIW or vector processors. As architects increasingly rely on parallel architectures to achieve higher performance, the problem of exception handling is becoming critical. In this thesis, I present software restart markers as the foundation of an exception handling mechanism for explicitly parallel architectures. With this model, the compiler is responsible for delimiting regions of idempotent code. If an exception occurs, the operating system will resume execution from the beginning of the region. One advantage of this approach is that instruction results can be committed to architectural state in any order within a region, eliminating the need to buffer those values. Enabling out-of-order commit can substantially reduce the exception management overhead found in precise exception implementations, and enable the use of new architectural features that might be prohibitively costly with conventional precise exception implementations. Additionally, software restart markers can be used to reduce context switch overhead in a multiprogrammed environment. This thesis demonstrates the applicability of software restart markers to vector, VLIW, and multithreaded architectures. It also contains an implementation of this exception handling approach that uses the Trimaran compiler infrastructure to target the Scale vector-thread architecture. I show that using software restart markers incurs very little performance overhead for vector-style execution on Scale. Finally, I describe the Scale compiler flow developed as part of this work and discuss how it targets certain features facilitated by the use of software restart markers. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Mark Hampton | M. Hampton

[1] Mikko H. Lipasti,et al. Deconstructing commit , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[2] Mateo Valero,et al. Adding a vector unit to a superscalar processor , 1999, ICS '99.

[3] Henry M. Levy,et al. Hardware and software support for efficient exception handling , 1994, ASPLOS VI.

[4] Edward McLellan. The Alpha AXP architecture and 21064 processor , 1993, IEEE Micro.

[5] Anoop Gupta,et al. The impact of architectural trends on operating system performance , 1995, SOSP.

[6] Andrew R. Pleszkun,et al. Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[7] Xia Chen,et al. A spatial path scheduling algorithm for EDGE architectures , 2006, ASPLOS XII.

[8] James E. Smith. Retrospective: implementing precise interrupts in pipelined processors , 1998, ISCA '98.

[9] Peter Y.-T. Hsu,et al. Overlapped loop support in the Cydra 5 , 1989, ASPLOS III.

[10] Jaewook Shin,et al. Superword-level parallelism in the presence of control flow , 2005, International Symposium on Code Generation and Optimization.

[11] Peter J. Denning. Virtual Memory , 1996, ACM Comput. Surv..

[12] David I. August,et al. Sentinel Scheduling with Recovery Blocks , 1995 .

[13] Michael Gschwind. The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor , 2007, International Journal of Parallel Programming.

[14] Anand Sivasubramaniam,et al. Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks , 2002, SIGMETRICS '02.

[15] Paolo Faraboschi,et al. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools , 2004 .

[16] Saman P. Amarasinghe,et al. Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[17] Uri C. Weiser,et al. MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[18] Per Stenström,et al. Limits on Thread-Level Speculative Parallelism in Embedded Applications , 2007 .

[19] D. Marr,et al. Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[20] Francisco J. Cazorla,et al. Kilo-instruction processors: overcoming the memory wall , 2005, IEEE Micro.

[21] James K. Pickett,et al. Enhanced superscalar hardware: The schedule table , 1993, Supercomputing '93. Proceedings.

[22] Balaram Sinharoy,et al. POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[23] Richard R. Oehler,et al. IBM RISC System/6000 Processor Architecture , 1990, IBM J. Res. Dev..

[24] Robert P. Colwell,et al. Architecture and implementation of a VLIW supercomputer , 1990, Proceedings SUPERCOMPUTING '90.

[25] Mark Jerome Hampton,et al. Exposing datapath elements to reduce microprocessor energy consumption , 2001 .

[26] Andrew W. Appel,et al. Virtual memory primitives for user programs , 1991, ASPLOS IV.

[27] Andrew R. Pleszkun,et al. WISQ: a restartable architecture using queues , 1987, ISCA '87.

[28] Sang Lyul Min,et al. Compiler-assisted demand paging for embedded systems with flash memory , 2004, EMSOFT '04.

[29] Gürhan Küçük,et al. Complexity-effective reorder buffer designs for superscalar processors , 2004, IEEE Transactions on Computers.

[30] Pat Conway,et al. The AMD Opteron Processor for Multiprocessor Servers , 2003, IEEE Micro.

[31] Vittorio Zaccaria,et al. Low-power data forwarding for VLIW embedded architectures , 2002, IEEE Trans. Very Large Scale Integr. Syst..

[32] Trevor N. Mudge,et al. Virtual memory in contemporary microprocessors , 1998, IEEE Micro.

[33] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[34] David J. Sager,et al. The microarchitecture of the Pentium 4 processor , 2001 .

[35] B. R. Rau,et al. The Cydra 5 Departmental Supercomputer: design philosophies, decisions and trade-offs , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[36] Vladimir M. Pentkovski,et al. Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[37] Andrew R. Pleszkun,et al. Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[38] David A. Patterson,et al. Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[39] Bradley G. Burgess,et al. The PowerPC 603 microprocessor: a high performance, low power, superscalar RISC microprocessor , 1994, Proceedings of COMPCON '94.

[40] David B. Loveman,et al. Program Improvement by Source-to-Source Transformation , 1977, J. ACM.

[41] Susan J. Eggers,et al. Mini-threads: increasing TLP on small-scale SMT processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[42] Krste Asanovic,et al. Energy-exposed instruction sets , 2002 .

[43] Cameron McNairy,et al. Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[44] Balaram Sinharoy,et al. POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[45] Christopher Batten,et al. Cache Refill/Access Decoupling for Vector Machines , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[46] William J. Dally,et al. Imagine: Media Processing with Streams , 2001, IEEE Micro.

[47] Chris Bailey,et al. A mechanism for implementing precise exceptions in pipelined processors , 2004, Euromicro Symposium on Digital System Design, 2004. DSD 2004..

[48] Richard E. Hank,et al. Region-based compilation: an introduction and motivation , 1995, MICRO 1995.

[49] Xiangrong Zhou,et al. Rapid and low-cost context-switch through embedded processor customization for real-time and control applications , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[50] Ho-Seop Kim,et al. An instruction set and microarchitecture for instruction level distributed processing , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[51] Andy D. Pimentel,et al. TriMedia CPU64 architecture , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[52] Ernst L. Leiss,et al. Modulo scheduling for the TMS320C6x VLIW DSP architecture , 1999, LCTES '99.

[53] Gurindar S. Sohi,et al. The use of multithreading for exception handling , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[54] Nasr Ullah,et al. The MC88110 implementation of precise exceptions in a superscalar architecture , 1993, CARN.

[55] Chris R. Jesshope,et al. A Microthreaded Architecture and its Compiler , 2006 .

[56] Scott A. Mahlke,et al. Trimaran: An Infrastructure for Research in Instruction-Level Parallelism , 2004, LCPC.

[57] Harry Dwyer,et al. An out-of-order superscalar processor with speculative execution and fast, precise interrupts , 1992, MICRO 25.

[58] Anantha P. Chandrakasan,et al. Low-power CMOS digital design , 1992 .

[59] Chong-Min Kyung,et al. New hardware scheme supporting precise exception handling for out-of-order execution , 1994 .

[60] Chia-Jiu Wang,et al. Implementing precise interruptions in pipelined RISC processors , 1993, IEEE Micro.

[61] M. Tremblay,et al. UltraSparc I: a four-issue processor supporting multimedia , 1996, IEEE Micro.

[62] Ryan N. Rakvic,et al. A comprehensive study of hardware/software approaches to improve TLB performance for java applications on embedded systems , 2006, MSPC '06.

[63] Dean M. Tullsen,et al. Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[64] David W. Anderson,et al. The IBM System/360 model 91: machine philosophy and instruction-handling , 1967 .

[65] Mateo Valero,et al. Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[66] Sumedh W. Sathaye,et al. A fast interrupt handling scheme for VLIW processors , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[67] Alain J. Martin,et al. Precise exceptions in asynchronous processors , 2001, Proceedings 2001 Conference on Advanced Research in VLSI. ARVLSI 2001.

[68] Bruce D. Lightner,et al. The Metaflow Lightning chipset , 1991, COMPCON Spring '91 Digest of Papers.

[69] Per Stenström,et al. Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[70] Ilhoon Shin,et al. SWL: a search-while-load demand paging scheme with NAND flash memory , 2007, LCTES '07.

[71] P.R. Wilson,et al. Pointer swizzling at page fault time: efficiently and compatibly supporting huge address spaces on standard hardware , 1992, [1992] Proceedings of the Second International Workshop on Object Orientation in Operating Systems.

[72] Mateo Valero,et al. Toward kilo-instruction processors , 2004, TACO.

[73] Brad Burgess,et al. A G3 PowerPC/sup TM/ superscalar low-power microprocessor , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[74] G. Blanck,et al. The SuperSPARC microprocessor , 1992, Digest of Papers COMPCON Spring 1992.

[75] Nader Vasseghi,et al. The Mips R4000 processor , 1992, IEEE Micro.

[76] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[77] Kevin O'Brien,et al. Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading , 1995, PACT.

[78] Chris R. Jesshope. Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines , 2001, Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001.

[79] William E. Weihl,et al. Register relocation: flexible contexts for multithreading , 1993, ISCA '93.

[80] Steven W. White,et al. POWER3: The next generation of PowerPC processors , 2000, IBM J. Res. Dev..

[81] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[82] Erik Brunvand,et al. Precise exception handling for a self-timed processor , 1995, Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors.

[83] Dave Christie. Developing the AMD-K5 architecture , 1996, IEEE Micro.

[84] Mark Horowitz,et al. Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[85] John Wawrzynek,et al. Vector microprocessors , 1998 .

[86] Vicki H. Allan,et al. Software pipelining , 1995, CSUR.

[87] William J. Dally,et al. The Named-State Register File: implementation and performance , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[88] Doug Hunt,et al. Advanced performance features of the 64-bit PA-8000 , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[89] Jaewook Shin. Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[90] Ken Kennedy,et al. Conversion of control dependence to data dependence , 1983, POPL '83.

[91] Hwa C. Torng,et al. Interrupt Handling for Out-of-Order Execution Processors , 1993, IEEE Trans. Computers.

[92] Allan Porterfield,et al. The Tera computer system , 1990 .

[93] André Seznec,et al. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[94] P. Faraboschi,et al. Lx: a technology platform for customizable VLIW embedded processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[95] 富田眞治. 20世紀の名著名論：R. M. Tomasulo : An Efficient Algorithm for Exploiting Multiple Arithmetic Units , 2004 .

[96] H. Peter Hofstee,et al. Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[97] William J. Dally,et al. Concurrent Event Handling through Multithreading , 1999, IEEE Trans. Computers.

[98] Richard L. Sites,et al. Alpha AXP architecture , 1993, CACM.

[99] Trevor N. Mudge,et al. Design Tradeoffs For Software-managed Tlbs , 1994, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[100] Chris R. Jesshope. Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines , 2001 .

[101] Ronny Krashinsky. Vector-thread architecture and implementation , 2007 .

[102] M. Frans Kaashoek,et al. Software prefetching and caching for translation lookaside buffers , 1994, OSDI '94.

[103] Jaewook Shin,et al. Evaluating compiler technology for control-flow optimizations for multimedia extension architectures , 2009, Microprocess. Microsystems.

[104] Rodric M. Rabbah,et al. Exploiting vector parallelism in software pipelined loops , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[105] Rajiv Gupta,et al. Comparison checking: an approach to avoid debugging of optimized code , 1999, ESEC/FSE-7.

[106] Harsh Sharangpani,et al. Itanium Processor Microarchitecture , 2000, IEEE Micro.

[107] Jaehyuk Huh,et al. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture , 2003, IEEE Micro.

[108] Wen-mei W. Hwu,et al. Modulo schedule buffers , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[109] Thomas L. Anderson,et al. The cydra 5 minisupercomputer: Architecture and implementation , 1993, The Journal of Supercomputing.

[110] Kathryn S. McKinley,et al. Static placement, dynamic issue (SPDI) scheduling for EDGE architectures , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[111] Christoforos E. Kozyrakis,et al. Overcoming the limitations of conventional vector processors , 2003, ISCA '03.

[112] J.F. Martinez,et al. Cherry: Checkpointed early resource recycling in out-of-order microprocessors , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[113] Christopher Batten,et al. The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[114] Mansur H. Samadzadeh,et al. Hardware/Software Cost Analysis of Interrupt Processing Strategies , 2001, IEEE Micro.

[115] Marc Tremblay,et al. High-performance throughput computing , 2005, IEEE Micro.

[116] Babak Falsafi,et al. Reference idempotency analysis: a framework for optimizing speculative execution , 2001, PPoPP '01.

[117] Jang-Suk Park,et al. A software-controlled prefetching mechanism for software-managed TLBs , 1995, Microprocess. Microprogramming.

[118] Thomas Thomas,et al. The PowerPC 620 microprocessor: a high performance superscalar RISC microprocessor , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[119] Milind Girkar,et al. Challenges in exploitation of loop parallelism in embedded applications , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[120] S. Peter Song,et al. The PowerPC 604 RISC microprocessor. , 1994, IEEE Micro.

[121] William J. Dally,et al. Compiling for stream processing , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[122] Haitham Akkary,et al. Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[123] Stamatis Vassiliadis,et al. Register renaming and dynamic speculation: an alternative approach , 1993, MICRO.

[124] Steven W. K. Tjiang,et al. SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[125] David B. Whalley,et al. Fast context switches: compiler and architectural support for preemptive scheduling , 1995, Microprocess. Microsystems.

[126] Jerry Huck,et al. Architectural support for translation table management in large address space machines , 1993, ISCA '93.

[127] Andrew Wolfe,et al. A variable instruction stream extension to the VLIW architecture , 1991, ASPLOS IV.

[128] Hunter Scales,et al. AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[129] Dana S. Henry. Adding Fast Interrupts to Superscalar Processors , 2005 .

[130] R. M. Tomasulo,et al. An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[131] Werner Buchholz. The IBM System/370 Vector Architecture , 1986, IBM Syst. J..

[132] Keith D. Underwood,et al. Characterizing a new class of threads in scientific applications for high end supercomputers , 2004, ICS '04.

[133] Gabriel H. Loh,et al. Static strands: safely collapsing dependence chains for increasing embedded power efficiency , 2005, LCTES.

[134] Yale N. Patt,et al. Performance benefits of large execution atomic units in dynamically scheduled machines , 1989, ICS '89.

[135] B. Ramakrishna Rau,et al. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[136] Harvey G. Cragon,et al. Interrupt Processing in Concurrent Processors , 1995, Computer.

[137] Aaron Smith,et al. Compiling for EDGE architectures , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[138] Peter Yan-Tek Hsu. Designing the TFP microprocessor , 1994, IEEE Micro.

[139] Theo Ungerer,et al. A survey of processors with explicit multithreading , 2003, CSUR.

[140] Brian N. Bershad,et al. The interaction of architecture and operating system design , 1991, ASPLOS IV.

[141] Matthew K. Farrens,et al. Code Partitioning in Decoupled Compilers , 2000, Euro-Par.

[142] Mike Johnson,et al. Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[143] Gary Goldman,et al. UltraSPARC-II: the advancement of ultracomputing , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[144] S SohiGurindar. Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers , 1990 .

[145] Antonio González,et al. Energy-effective issue logic , 2001, ISCA 2001.

[146] Colin Whitby-Strevens. The transputer , 1985, ISCA 1985.

[147] Yale N. Patt,et al. Checkpoint repair for out-of-order execution machines , 1987, ISCA '87.

[148] Michael Gschwind,et al. Optimizing Compiler for the CELL Processor , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[149] N. Seshan. High VelociTI processing [Texas Instruments VLIW DSP architecture] , 1998 .

[150] Peter F. Sweeney,et al. Multiple page size modeling and optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[151] Richard M. Russell,et al. The CRAY-1 computer system , 1978, CACM.

[152] Richard E. Kessler,et al. The Alpha 21264 microprocessor , 1999, IEEE Micro.

[153] Keith Diefendorff. K7 Challenges Intel: 10/26/98 , 1998 .

[154] Aamer Jaleel,et al. In-line interrupt handling and lock-up free translation lookaside buffers (TLBs) , 2006, IEEE Transactions on Computers.

[155] Yasuhiko Hagihara,et al. A hardware overview of SX-6 and SX-7 supercomputer , 2003 .

[156] DeForest Tovey,et al. Microarchitecture of HaL's CPU , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[157] Trevor N. Mudge,et al. A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[158] Josep Llosa,et al. Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[159] Tzi-cker Chiueh,et al. Multi-threaded vectorization , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[160] Kenneth C. Yeager. The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[161] Stamatis Vassiliadis,et al. Precise Interrupts , 1996, IEEE Micro.

[162] Gary Lauterbach,et al. UltraSPARC-III: designing third-generation 64-bit performance , 1999, IEEE Micro.

[163] Gary Gibson,et al. The Metaflow architecture , 1991, IEEE Micro.

[164] John Paul Shen,et al. Balancing Fine- and Medium-Grained Parallelism in Scheduling Loops for the XIMD Architecture , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[165] David A. Patterson,et al. Scalable Vector Media-processors for Embedded Systems , 2002 .

[166] John H. Edmondson,et al. Superscalar instruction execution in the 21164 Alpha microprocessor , 1995, IEEE Micro.

[167] Alan E. Charlesworth,et al. An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family , 1981, Computer.

[168] Ricardo Bianchini,et al. The MIT Alewife machine: architecture and performance , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[169] Mateo Valero,et al. Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[170] Haitham Akkary,et al. Checkpoint Processing and Recovery: An Efficient, Scalable Alternative to Reorder Buffers , 2003, IEEE Micro.

[171] A. Klaiber. The Technology Behind Crusoe TM Processors Low-power x 86-Compatible Processors Implemented with Code Morphing , 2000 .

[172] Corinna G. Lee,et al. Simple vector microprocessors for multimedia applications , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[173] Haitham Akkary,et al. Continual flow pipelines , 2004, ASPLOS XI.

[174] Burton J. Smith,et al. A processor architecture for Horizon , 1988, Proceedings. SUPERCOMPUTING '88.

[175] David A. Padua,et al. Advanced compiler optimizations for supercomputers , 1986, CACM.

[176] Kunle Olukotun,et al. Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[177] Trevor Mudge,et al. Improving data cache performance by pre-executing instructions under a cache miss , 1997 .

[178] Scott A. Mahlke,et al. Region-based hierarchical operation partitioning for multicluster processors , 2003, PLDI '03.

[179] Robert P. Colwell,et al. A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[180] Norman P. Jouppi,et al. A simulation based study of TLB performance , 1992, ISCA '92.

[181] G. Kandiraju,et al. Going the distance for TLB prefetching: an application-driven study , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[182] Krste Asanovic,et al. Compiling for vector-thread architectures , 2008, CGO '08.

[183] Jian Huang,et al. The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[184] James C. Dehnert,et al. Overlapped loop support in the Cydra 5 , 1989, ASPLOS 1989.

[185] Robert E. Tarjan,et al. Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[186] Trevor N. Mudge,et al. Virtual Memory: Issues of Implementation , 1998, Computer.

[187] Kevin W. Rudd,et al. Efficient Exception Handling Techniques for High-Performance Processor Architectures , 1997 .

[188] Masayuki Ikeda,et al. Architecture of the VPP500 parallel supercomputer , 1994, Proceedings of Supercomputing '94.

[189] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[190] Krste Asanovic,et al. Implementing virtual memory in a vector processor with software restart markers , 2006, ICS '06.

[191] Todd M. Austin,et al. High-Bandwidth Address Translation for Multiple-Issue Processors , 1996, ISCA.

[192] Maurice Herlihy,et al. Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[193] Steven R. Kunkel,et al. A multithreaded PowerPC processor for commercial servers , 2000, IBM J. Res. Dev..

[194] Burton J. Smith. Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[195] Matthew Mattina,et al. Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[196] S. Alii,et al. A mechanism for implementing precise exceptions in pipelined processors , 2004 .