Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs

We present program demultiplexing (PD), an execution paradigm that creates concurrency in sequential programs by "demultiplexing" methods (functions or subroutines). Call sites of a demultiplexed method in the program are associated with handlers that allow the method to be separated from the sequential program and executed on an auxiliary processor. The demultiplexed execution of a method (and its handler) is speculative and occurs when the inputs of the method are (speculatively) available, which is typically far in advance of when the method is actually called in the sequential execution. A trigger, composed of predicates that are based on program counters and memory write addresses, launches the speculative execution of the method on another processor. Our implementation of PD is based on a full-system execution-based chip multi-processor simulator with software to generate triggers and handlers from an x86-program binary. We evaluate eight integer benchmarks from the SPEC2000 suite - programs written in C with no explicit concurrency and/or motivation to create concurrency - and achieve a harmonic mean speedup of 1.8x with our implementation of PD

[1]  B. Ramakrishna Rau,et al.  EPIC: Explicititly Parallel Instruction Computing , 2000, Computer.

[2]  Frank Tip,et al.  A survey of program slicing techniques , 1994, J. Program. Lang..

[3]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[4]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[5]  T. W. Christopher,et al.  Early experience with object-oriented message driven computing , 1990, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation.

[6]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[7]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[8]  Antonio González,et al.  Clustered speculative multithreaded processors , 1999, ICS '99.

[9]  James R. Larus C**: A Large-Grain, Object-Oriented, Data-Parallel Programming Language , 1992, LCPC.

[10]  McNairyCameron,et al.  Itanium 2 Processor Microarchitecture , 2003 .

[11]  Steve Johnson,et al.  Compiling C for vectorization, parallelization, and inline expansion , 1988, PLDI '88.

[12]  D. Geer,et al.  Chip makers turn to multicore processors , 2005, Computer.

[13]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[14]  Jeffrey Su,et al.  A dual-core 64-bit ultraSPARC microprocessor for dense server applications , 2004, IEEE Journal of Solid-State Circuits.

[15]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[16]  Jeffrey S. Chase,et al.  The Amber system: parallel programming on a network of multiprocessors , 1989, SOSP '89.

[17]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[18]  V. G. Grafe,et al.  The Epsilon dataflow processor , 1989, ISCA '89.

[19]  Kunle Olukotun,et al.  Programming with transactional coherence and consistency (TCC) , 2004, ASPLOS XI.

[20]  H BloomBurton Space/time trade-offs in hash coding with allowable errors , 1970 .

[21]  Mark Scott Johnson Some requirements for architectural support of software debugging , 1982, ASPLOS I.

[22]  James R. McGraw,et al.  The VAL Language: Description and Analysis , 1982, TOPL.

[23]  Suresh Jagannathan,et al.  Safe futures for Java , 2005, OOPSLA '05.

[24]  Kunle Olukotun,et al.  The Jrpm system for dynamically parallelizing Java programs , 2003, ISCA '03.

[25]  Chen Yang,et al.  A cost-driven compilation framework for speculative parallelization of sequential programs , 2004, PLDI '04.

[26]  K. M. George,et al.  Parallelizing translator for an object-oriented parallel programming language , 1991, [1991 Proceedings] Tenth Annual International Phoenix Conference on Computers and Communications.

[27]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[28]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[29]  Gurindar S. Sohi,et al.  Instruction issue logic for high-performance, interruptable pipelined processors , 1987, ISCA '87.

[30]  Christopher J. Hughes,et al.  Hybrid transactional memory , 2006, PPoPP '06.

[31]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[32]  Monica S. Lam,et al.  Interprocedural parallelization analysis in SUIF , 2005, TOPL.

[33]  Jong-Deok Choi,et al.  The Jalape�o Dynamic Optimizing Compiler for JavaTM , 1999, JAVA '99.

[34]  Aart J. C. Bik Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance , 2004 .

[35]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[36]  Milind Girkar Functional parallelism: theoretical foundations and implementation , 1992 .

[37]  K. Mani Chandy,et al.  Compositional C++: Compositional Parallel Programming , 1992, LCPC.

[38]  Monica S. Lam,et al.  In search of speculative thread-level parallelism , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[39]  Vivek Sarkar,et al.  Partitioning parallel programs for macro-dataflow , 1986, LFP '86.

[40]  Bantwal R. Rau Dynamically scheduled VLIW processors , 1993, MICRO 1993.

[41]  Vikram S. Adve,et al.  LLVA: a low-level virtual instruction set architecture , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[42]  Wing Cheong Lau,et al.  An Object-Oriented Class Library for Scalable Parallel Heuristic Search , 1992, ECOOP.

[43]  Rohit Bhatia,et al.  Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[44]  A. A. Chien,et al.  Object-oriented concurrent programming in CST , 1988, C3P.

[45]  David L. Weaver,et al.  The SPARC Architecture Manual , 2003 .

[46]  Gul A. Agha,et al.  HAL: A High-Level Actor Language and Its Distributed Implementation , 1992, ICPP.

[47]  SankaralingamKarthikeyan,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003 .

[48]  Milind Girkar,et al.  Extracting task-level parallelism , 1995, TOPL.

[49]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[50]  Matthew Mattina,et al.  Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[51]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[52]  P. Hudak,et al.  Implementing functional programs on a hypercube multiprocessor , 1988, C3P.

[53]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1987, IEEE Trans. Computers.

[54]  Robert P. Colwell,et al.  Architecture and implementation of a VLIW supercomputer , 1990, Proceedings SUPERCOMPUTING '90.

[55]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[56]  David E. Culler,et al.  Compiler-Controlled Multithreading for Lenient Parallel Languages , 1991, FPCA.

[57]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[58]  Ken Kennedy,et al.  Interprocedural transformations for parallel code generation , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[59]  Brian N. Bershad,et al.  Fast, effective dynamic compilation , 1996, PLDI '96.

[60]  R. P. Colwell,et al.  A 0.6 /spl mu/m BiCMOS processor with dynamic execution , 1995, Proceedings ISSCC '95 - International Solid-State Circuits Conference.

[61]  Eduard Ayguadé,et al.  Increasing effective IPC by exploiting distant parallelism , 1999, ICS '99.

[62]  Matthew Arnold,et al.  Adaptive optimization in the Jalapeño JVM , 2000, OOPSLA '00.

[63]  Zhiyuan Li,et al.  Efficient interprocedural analysis for program parallelization and restructuring , 1988, PPEALS '88.

[64]  Yale N. Patt,et al.  Difficult-path branch prediction using subordinate microthreads , 2002, ISCA.

[65]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[66]  Vivek Sarkar,et al.  Automatic discovery of parallelism: a tool and an experiment (extended abstract) , 1988, PPoPP 1988.

[67]  Pen-Chung Yew,et al.  Efficient interprocedural analysis for program parallelization and restructuring , 1988, PPoPP 1988.

[68]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[69]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[70]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[71]  Nancy M. Amato,et al.  Run-time methods for parallelizing partially parallel loops , 1995, ICS '95.

[72]  Kevin O'Brien,et al.  Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading , 1995, PACT.

[73]  Wei Liu,et al.  POSH: a TLS compiler that exploits program structure , 2006, PPoPP '06.

[74]  Todd M. Austin,et al.  Dynamic dependency analysis of ordinary programs , 1992, ISCA '92.

[75]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '98.

[76]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[77]  Olivier Temam,et al.  CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[78]  Mark Moir,et al.  Transparent Support for Wait-Free Transactions , 1997, WDAG.

[79]  Yale N. Patt,et al.  HPS, a new microarchitecture: rationale and introduction , 1985, MICRO 18.

[80]  Dennis Gannon,et al.  Distributed pC++ Basic Ideas for an Object Parallel Language , 1993, Sci. Program..

[81]  James M. Stichnoth,et al.  Practicing JUDO: Java under dynamic optimizations , 2000, PLDI '00.

[82]  Utpal Banerjee,et al.  Speedup of ordinary programs , 1979 .

[83]  Dean M. Tullsen,et al.  Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices , 2005, PLDI '05.

[84]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[85]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[86]  Harsh Sharangpani,et al.  Itanium Processor Microarchitecture , 2000, IEEE Micro.

[87]  Kenji Nishida,et al.  Evaluation of a prototype data flow processor of the SIGMA-1 for scientific computations , 1986, ISCA 1986.

[88]  Mark Moir,et al.  Hybrid transactional memory , 2006, ASPLOS XII.

[89]  David J. Lilja Exploiting the parallelism available in loops , 1994, Computer.

[90]  Frank Yellin,et al.  The Java Virtual Machine Specification , 1996 .

[91]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.

[92]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[93]  Wei Liu,et al.  Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation , 2005, ICS '05.

[94]  Mark Weiser,et al.  Program Slicing , 1981, IEEE Transactions on Software Engineering.

[95]  Ken Kennedy,et al.  A technique for summarizing data access and its use in parallelism enhancing transformations , 1989, PLDI '89.

[96]  Scott A. Mahlke,et al.  IMPACT: an architectural framework for multiple-instruction-issue processors , 1991, ISCA '91.

[97]  Rudolf Eigenmann,et al.  Automatic program parallelization , 1993, Proc. IEEE.

[98]  Constantine Demetrios Polychronopoulos On program restructuring, scheduling, and communication for parallel processor systems , 1986 .

[99]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[100]  Hiroshi Yasuhara,et al.  DDDP-a Distributed Data Driven Processor , 1983, ISCA '83.

[101]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[102]  C. Zilles,et al.  Time-Shifted Modules : Exploiting Code Modularity for Fine Grain Parallelization , 2000 .

[103]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[104]  Harish Patil,et al.  Efficient Run-time Monitoring Using Shadow Processing , 1995, AADEBUG.

[105]  Paul Feautrier,et al.  Direct parallelization of call statements , 1986, SIGPLAN '86.

[106]  Luca Cardelli,et al.  Modern concurrency abstractions for C# , 2002, TOPL.

[107]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[108]  Kunle Olukotun,et al.  Characterization of TCC on chip-multiprocessors , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[109]  Ron Cytron,et al.  Interprocedural dependence analysis and parallelization , 1986, SIGP.

[110]  Suresh Jagannathan,et al.  Transactional Monitors for Concurrent Objects , 2004, ECOOP.

[111]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[112]  Per Stenström,et al.  Limits on speculative module-level parallelism in imperative and object-oriented programs on CMP platforms , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[113]  Anastasia Ailamaki,et al.  Tolerating Dependences Between Large Speculative Threads Via Sub-Threads , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[114]  Thomas F. Knight An architecture for mostly functional languages , 1986, LFP '86.

[115]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[116]  Rishiyur S. Nikhil,et al.  The Parallel Programming Language Id and its Compilation for Parallel Machines , 1993, Int. J. High Speed Comput..

[117]  Josep Torrellas,et al.  Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[118]  Kunle Olukotun,et al.  Exposing speculative thread parallelism in SPEC2000 , 2005, PPoPP.

[119]  Per Stenström,et al.  Reducing misspeculation overhead for module-level speculative execution , 2005, CF '05.

[120]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[121]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[122]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[123]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[124]  C. Zilles,et al.  Understanding the backward slices of performance degrading instructions , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[125]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[126]  Andrew S. Grimshaw,et al.  Easy-to-use object-oriented parallel processing with Mentat , 1993, Computer.

[127]  Quinn Jacobson,et al.  Architectural Support for Software Transactional Memory , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[128]  Bernd Mohr,et al.  Performance analysis of pC++: a portable data-parallel programming system for scalable parallel computers , 1994, Proceedings of 8th International Parallel Processing Symposium.

[129]  Satoshi Matsushita,et al.  Pinot: speculative multi-threading processor architecture exploiting parallelism over a wide range of granularities , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[130]  James R. Larus,et al.  Software and the Concurrency Revolution , 2005, ACM Queue.

[131]  Matthew Arnold,et al.  Online feedback-directed optimization of Java , 2002, OOPSLA '02.

[132]  Josep Torrellas,et al.  An efficient algorithm for the run-time parallelization of DOACROSS loops , 1994, Proceedings of Supercomputing '94.

[133]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[134]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[135]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[136]  Monica S. Lam,et al.  Array-data flow analysis and its use in array privatization , 1993, POPL '93.

[137]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[138]  Luis Ceze,et al.  Implicit parallelism with ordered transactions , 2007, PPoPP.

[139]  Mark Weiser,et al.  Programmers use slices when debugging , 1982, CACM.

[140]  Mayank Agarwal,et al.  Exploiting Postdominance for Speculative Parallelization , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[142]  Antonio González,et al.  Thread-spawning schemes for speculative multithreading , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[143]  Carl Hewitt,et al.  Viewing Control Structures as Patterns of Passing Messages , 1977, Artif. Intell..

[144]  Wei Liu,et al.  AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[145]  Lionel M. Ni,et al.  Dependence Uniformization: A Loop Parallelization Technique , 1993, IEEE Trans. Parallel Distributed Syst..

[146]  Zhiyuan Li,et al.  Array privatization for parallel execution of loops , 1992 .

[147]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[148]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[149]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[150]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[151]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[152]  Rudolf Eigenmann,et al.  Min-cut program decomposition for thread-level speculation , 2004, PLDI '04.

[153]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[154]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[155]  Andreas Moshovos,et al.  Improving virtual function call target prediction via dependence-based pre-computation , 1999, ICS '99.

[156]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[157]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[158]  Rudolf Eigenmann,et al.  Speculative thread decomposition through empirical optimization , 2007, PPoPP.

[159]  Olivier Temam,et al.  Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[160]  T. Yuba,et al.  An architecture of a dataflow single chip processor , 1989, ISCA '89.

[161]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[162]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[163]  Balaram Sinharoy,et al.  Design and implementation of the POWER5 microprocessor , 2004, Proceedings. 41st Design Automation Conference, 2004..

[164]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[165]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[166]  Andrew A. Chien,et al.  Concurrent aggregates (CA) , 1990, PPOPP '90.

[167]  Antonia Zhai,et al.  Improving value communication for thread-level speculation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[168]  Kenji Nishida,et al.  Evaluation of a Prototype Data Flow Processor of the SIGMA-1 for Scientific Computations , 1986, ISCA.

[169]  Gavin M. Bierman,et al.  The Essence of Data Access in Comega , 2005, European Conference on Object-Oriented Programming.

[170]  Joseph A. Fisher,et al.  Very Long Instruction Word architectures and the ELI-512 , 1983, ISCA '83.

[171]  Ian Watson,et al.  The Manchester prototype dataflow computer , 1985, CACM.

[172]  Pierre America,et al.  Issues in the design of a parallel object-oriented language , 1989, Formal Aspects of Computing.

[173]  James E. Smith,et al.  An instruction set and microarchitecture for instruction level distributed processing , 2002, ISCA.

[174]  Nicholas Carriero,et al.  Linda in context , 1989, CACM.

[175]  John Paul Shen,et al.  Mitosis: A Speculative Multithreaded Processor Based on Precomputation Slices , 2008, IEEE Transactions on Parallel and Distributed Systems.

[176]  Ken Kennedy,et al.  Loop distribution with arbitrary control flow , 1990, Proceedings SUPERCOMPUTING '90.

[177]  Marc Tremblay,et al.  The MAJC Architecture: A Synthesis of Parallelism and Scalability , 2000, IEEE Micro.

[178]  Bob Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[179]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[180]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[181]  J. Ramanujam,et al.  A methodology for parallelizing programs for multicomputers and complex memory multiprocessors , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[182]  Thomas Rauber,et al.  The shared-memory language pSather on a distributed-memory multiprocessor , 1993, SIGP.

[183]  Josep Torrellas,et al.  ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes , 2003, ISCA '03.

[184]  Scott A. Mahlke,et al.  Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[185]  Josep Torrellas,et al.  Architectural support for scalable speculative parallelization in shared-memory multiprocessors , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[186]  Todd C. Mowry,et al.  Hardware support for thread-level speculation , 2003 .

[187]  Jack B. Dennis,et al.  First version of a data flow procedure language , 1974, Symposium on Programming.

[188]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[189]  Koen De Bosschere,et al.  LANCET: a nifty code editing tool , 2005, PASTE '05.

[190]  Kunle Olukotun,et al.  Exploiting method-level parallelism in single-threaded Java programs , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[191]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[192]  Kathryn S. McKinley Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors , 1994, ICS '94.

[193]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, ICS.

[194]  Wilson C. Hsieh,et al.  Automatic generation of DAG parallelism , 1989, PLDI '89.

[195]  Alexandru Nicolau,et al.  Parallelizing Programs with Recursive Data Structures , 1989, IEEE Trans. Parallel Distributed Syst..

[196]  Antonio González,et al.  A quantitative assessment of thread-level speculation techniques , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[197]  Wei Liu,et al.  iWatcher: efficient architectural support for software debugging , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[198]  A. L. Davis,et al.  The architecture and system method of DDM1: A recursively structured Data Driven Machine , 1978, ISCA '78.

[199]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[200]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[201]  Dionisios N. Pnevmatikatos,et al.  Slice-processors: an implementation of operation-based prediction , 2001, ICS '01.

[202]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[203]  Ken Kennedy,et al.  Automatic decomposition of scientific programs for parallel execution , 1987, POPL '87.

[204]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[205]  Wolfram Schulte,et al.  The essence of data access in Cω: the power is in the dot! , 2005 .

[206]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[207]  B. Ramakrishna Rau,et al.  Instruction-level parallel processing: History, overview, and perspective , 2005, The Journal of Supercomputing.

[208]  Gurindar S. Sohi,et al.  Master/slave speculative parallelization and approximate code , 2002 .

[209]  Gurindar S. Sohi,et al.  Speculative Multithreaded Processors , 2001, Computer.

[210]  Antonio González,et al.  Speculative multithreaded processors , 1998, ICS '98.

[211]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[212]  Kunle Olukotun,et al.  Using thread-level speculation to simplify manual parallelization , 2003, PPoPP '03.

[213]  Maurice Herlihy,et al.  Virtualizing Transactional Memory , 2005, ISCA 2005.

[214]  Milind Girkar,et al.  Automatic Extraction of Functional Parallelism from Ordinary Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[215]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[216]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[217]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[218]  米沢 明憲 ABCL : an object-oriented concurrent system , 1990 .

[219]  Bradley C. Kuszmaul,et al.  Unbounded transactional memory , 2005, 11th International Symposium on High-Performance Computer Architecture.

[220]  Gurindar S. Sohi,et al.  The Expandable Split Window Paradigm for Exploiting Fine-grain Parallelism , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[221]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[222]  Derek Bruening,et al.  An infrastructure for adaptive dynamic optimization , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[223]  Babak Falsafi,et al.  Implicitly-multithreaded processors , 2003, ISCA '03.