System Support for Implicitly Parallel Programming

Implicit parallelization involves developing parallel algorithms and applications in environments that provide sequential semantics, e.g., the C programming language. System tools convert the parallel algorithms into a set of threads partitioned appropriately for a particular parallel machine organization. The resulting parallel programs are easier and faster to develop, debug and maintain, because the programmer can request a meaningful and well defined program state at any point of execution. The contribution of this paper is a case study of a video encoding application. We show that error checking code, code reuse, and variable scoping interfere with parallelization. We suggest that system tools must perform reactive and speculative transformations if they are to reduce this tension between application robustness and parallelization.

[1]  Rajiv Gupta,et al.  Complete removal of redundant expressions , 1998, PLDI 1998.

[2]  Eric Rotenberg,et al.  Transparent control independence (TCI) , 2007, ISCA '07.

[3]  Matthew I. Frank,et al.  A Software Framework for Supporting General Purpose Applications on Raw Computation Fabrics , 2001 .

[4]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[5]  Michael Hind,et al.  Loop distribution with multiple exits , 1992, Proceedings Supercomputing '92.

[6]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[7]  Alexandru Nicolau,et al.  Run-Time Disambiguation: Coping with Statically Unpredictable Dependencies , 1989, IEEE Trans. Computers.

[8]  P. Feautrier Array expansion , 1988 .

[9]  Larry Rudolph,et al.  The START-VOYAGER parallel system , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[10]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[11]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[12]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[13]  Kevin O'Brien,et al.  Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading , 1995, PACT.

[14]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[15]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[16]  Scott A. Mahlke,et al.  Dynamic memory disambiguation using the memory conflict buffer , 1994, ASPLOS VI.

[17]  Scott A. Mahlke,et al.  Integrated predicated and speculative execution in the IMPACT EPIC architecture , 1998, ISCA.

[18]  Sam S. Stone,et al.  Address-indexed memory disambiguation and store-to-load forwarding , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[19]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[20]  Eric Rotenberg,et al.  Control independence in trace processors , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[21]  Andrew W. Appel,et al.  SSA is functional programming , 1998, SIGP.

[22]  Chen Yang,et al.  A cost-driven compilation framework for speculative parallelization of sequential programs , 2004, PLDI '04.

[23]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[24]  Mendel Rosenblum,et al.  Stream programming on general-purpose processors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[25]  Anant Agarwal,et al.  SUDS: Primitive Mechanisms for Memory Dependence Speculation , 1999 .

[26]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[27]  Michael Gschwind,et al.  Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.

[28]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[29]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[30]  Sanjay J. Patel,et al.  Performance characterization of a hardware mechanism for dynamic optimization , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[31]  ParallelismChih,et al.  Compiling Sequential Programs for Speculative , 1993 .

[32]  Anant Agarwal,et al.  Constructing virtual architectures on a tiled processor , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[33]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[34]  David J. Lilja,et al.  Coarse-grained speculative execution in shared-memory multiprocessors , 1998, ICS '98.

[35]  David A. Padua,et al.  Automatic Array Privatization , 1993, Compiler Optimizations for Scalable Parallel Systems Languages.

[36]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[37]  Dean M. Tullsen,et al.  Control Flow Optimization Via Dynamic Reconvergence Prediction , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[38]  Josep Torrellas,et al.  Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor , 1998, ICS '98.

[39]  Richard A. Kelsey A correspondence between continuation passing style and static single assignment form , 1995 .

[40]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[41]  Satoshi Matsushita,et al.  Pinot: speculative multi-threading processor architecture exploiting parallelism over a wide range of granularities , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[42]  Satoshi Matsuoka,et al.  Highly efficient and encapsulated re-use of synchronization code in concurrent object-oriented languages , 1993, OOPSLA '93.

[43]  Bradley C. Kuszmaul,et al.  Unbounded Transactional Memory , 2005, HPCA.

[44]  Ken Kennedy,et al.  Loop distribution with arbitrary control flow , 1990, Proceedings SUPERCOMPUTING '90.

[45]  Pete Tinker,et al.  Parallel execution of sequential scheme with ParaTran , 1988, LISP and Functional Programming.

[46]  K. Ebcioglu,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[47]  Sanjay J. Patel,et al.  Implicitly Parallel Programming Models for Thousand-Core Microprocessors , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[48]  Chen-Yong Cher,et al.  Skipper: a microarchitecture for exploiting control-flow independence , 2001, MICRO.

[49]  Andrea C. Arpaci-Dusseau,et al.  Fast Parallel Sorting Under LogP: Experience with the CM-5 , 1996, IEEE Trans. Parallel Distributed Syst..

[50]  William Thies,et al.  Linear analysis and optimization of stream programs , 2003, PLDI '03.

[51]  Todd C. Mowry,et al.  Tolerating Dependences Between Large Speculative Threads Via Sub-Threads , 2006, ISCA 2006.

[52]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, ICS.

[53]  T. N. Vijaykumar,et al.  Implicitly-multithreaded processors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[54]  Eddie Kohler,et al.  Programming language optimizations for modular router configurations , 2002, ASPLOS X.

[55]  S. S. Stone Multiversioning in the Store Queue Is the Root of All Store-forwarding Evil , 2022 .

[56]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[57]  Markus Mock,et al.  Calpa: a tool for automating selective dynamic compilation , 2000, MICRO 33.

[58]  Wen-mei W. Hwu,et al.  Field-testing IMPACT EPIC research results in Itanium 2 , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[59]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[60]  Brian N. Bershad,et al.  Fast, effective dynamic compilation , 1996, PLDI '96.

[61]  Bjarne Steensgaard Sparse functional stores for imperative programs , 1995 .

[62]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[63]  Janak H. Patel,et al.  Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..

[64]  Milind Girkar,et al.  On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings , 2006, ICS '06.

[65]  Ron Cytron,et al.  What's In a Name? -or- The Value of Renaming for Parallelism Detection and Storage Allocation , 1987, ICPP.

[66]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[67]  Erik R. Altman,et al.  Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[68]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[69]  Monica S. Lam,et al.  Array-data flow analysis and its use in array privatization , 1993, POPL '93.

[70]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999, JACM.

[71]  Yen-Kuang Chen,et al.  The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[72]  Antonio González,et al.  Speculative multithreaded processors , 1998, ICS '98.

[73]  L. Rauchwerger,et al.  The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization , 1999, IEEE Trans. Parallel Distributed Syst..

[74]  Guy L. Steele Debunking the “expensive procedure call” myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO , 1977, ACM '77.

[75]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[76]  Josep Torrellas,et al.  Removing architectural bottlenecks to the scalability of speculative parallelization , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[77]  Markus Mock,et al.  Dynamic points-to sets: a comparison with static analyses and potential applications in program understanding and optimization , 2001, PASTE '01.

[78]  Markus Mock,et al.  DyC: an expressive annotation-directed dynamic compiler for C , 2000, Theor. Comput. Sci..

[79]  Gurindar S. Sohi,et al.  The Expandable Split Window Paradigm for Exploiting Fine-grain Parallelism , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[80]  Rajiv Gupta,et al.  Complete removal of redundant expressions , 1998, PLDI 1998.

[81]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[82]  Babak Falsafi,et al.  Implicitly-multithreaded processors , 2003, ISCA '03.

[83]  Jenn-Yuan Tsai,et al.  The superthreaded architecture: thread pipelining with run-time data dependence checking and control speculation , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[84]  Amir Roth,et al.  Ginger: control independence using tag rewriting , 2007, ISCA '07.

[85]  John V. Guttag,et al.  Design and implementation of software radios using a general purpose processor , 1999 .

[86]  Matthew I. Frank,et al.  SUDS: automatic parallelization for raw processors , 2003 .

[87]  Wen-mei W. Hwu,et al.  Automatic Discovery of Coarse-Grained Parallelism in Media Applications , 2007, Trans. High Perform. Embed. Archit. Compil..

[88]  Wei Liu,et al.  Tasking with out-of-order spawn in TLS chip multiprocessors: microarchitecture and compilation , 2005, ICS '05.

[89]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[90]  Mayank Agarwal,et al.  Exploiting Postdominance for Speculative Parallelization , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[91]  James R. Larus,et al.  EEL: machine-independent executable editing , 1995, PLDI '95.

[92]  J.F. Martinez,et al.  Cherry: Checkpointed early resource recycling in out-of-order microprocessors , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[93]  Sanjay J. Patel,et al.  rePLay: A Hardware Framework for Dynamic Optimization , 2001, IEEE Trans. Computers.

[94]  Thomas F. Knight An architecture for mostly functional languages , 1986, LFP '86.

[95]  Maurice Herlihy,et al.  Virtualizing transactional memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[96]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[97]  William J. Dally,et al.  The J-machine Multicomputer: An Architectural Evaluation , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[98]  Yale N. Patt,et al.  Checkpoint repair for out-of-order execution machines , 1987, ISCA '87.

[99]  Andreas Moshovos,et al.  Dynamic Speculation and Synchronization of Data Dependences , 1997, ISCA.

[100]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[101]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[102]  Craig B. Zilles,et al.  Hardware atomicity for reliable software speculation , 2007, ISCA '07.

[103]  Guy E. Blelloch,et al.  Solving linear recurrences with loop raking , 1992, Proceedings Sixth International Parallel Processing Symposium.