Search Space Properties for Mapping Coarse-Grain Pipelined FPGA Applications

This paper describes an automated approach to hardware design space exploration, through a collaboration between parallelizing compiler technology and high-level synthesis tools. In previous work, we described a compiler algorithm that optimizes individual loop nests, expressed in C, to derive an efficient FPGA implementation. In this paper, we describe a global optimization strategy that maps multiple loop nests to a coarse-grain pipelined FPGA implementation. The global optimization algorithm automatically transforms the computation to incorporate explicit communication and data reorganization between pipeline stages, and uses metrics to guide design space exploration to consider the impact of communication and to achieve balance between producer and consumer data rates across pipeline stages. We illustrate the components of the algorithm with a case study, a machine vision kernel.

[1]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[2]  Mark Horowitz,et al.  Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[3]  Seth J. Teller,et al.  The cricket compass for context-aware mobile applications , 2001, MobiCom '01.

[4]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[5]  David B. Whalley,et al.  Efficient and effective branch reordering using profile data , 2002, TOPL.

[6]  Ken Kennedy,et al.  Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries , 2001, J. Parallel Distributed Comput..

[7]  David E. Culler,et al.  System architecture directions for networked sensors , 2000, SIGP.

[8]  R.K. Brunner,et al.  Adapting to load on workstation clusters , 1999, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[9]  Margaret H. Dunham,et al.  Common Subexpression Processing in Multiple-Query Processing , 1998, IEEE Trans. Knowl. Data Eng..

[10]  Keshav Pingali,et al.  A case for source-level transformations in MATLAB , 1999, DSL '99.

[11]  Mark N. Wegman,et al.  Analysis of pointers and structures , 1990, SIGP.

[12]  Umesh Kumar,et al.  An Efficient Algorithm to Compute Delay Set in SPMD Programs , 2003, HiPC.

[13]  Michael Voss,et al.  Portable Compilers for OpenMP , 2001, WOMPAT.

[14]  Keith D. Cooper,et al.  Register promotion in C programs , 1997, PLDI '97.

[15]  Etienne Morel,et al.  Global optimization by suppression of partial redundancies , 1979, CACM.

[16]  Yoichi Muraoka,et al.  Measurements of parallelism in ordinary FORTRAN programs , 1974, Computer.

[17]  Zhiyuan Li,et al.  An Interprocedural Parallelizing Compiler and Its Support for Memory Hierarchy Research , 1995, LCPC.

[18]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[19]  Karthik Gargi A sparse algorithm for predicated global value numbering , 2002, PLDI '02.

[20]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[21]  Joel H. Saltz,et al.  Compiler techniques for data parallel applications using very large multi-dimensional datasets , 2001 .

[22]  Jeremy D. Frens,et al.  Language support for Morton-order matrices , 2001, PPoPP '01.

[23]  Hari Balakrishnan,et al.  6th ACM/IEEE International Conference on on Mobile Computing and Networking (ACM MOBICOM ’00) The Cricket Location-Support System , 2022 .

[24]  Ronald Minnich,et al.  A network-failure-tolerant message-passing system for terascale clusters , 2002, ICS '02.

[25]  Lars Ole Andersen,et al.  Program Analysis and Specialization for the C Programming Language , 2005 .

[26]  Gurindar S. Sohi,et al.  Understanding the differences between value prediction and instruction reuse , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[27]  Chialin Chang,et al.  Parallel aggregation on multi-dimensional scientific datasets , 2001 .

[28]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[29]  Robert S. Gray,et al.  Agent Tcl: a Exible and Secure Mobile-agent System , 1996 .

[30]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[31]  Paul Feautrier,et al.  Improving Data Locality by Chunking , 2003, CC.

[32]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[33]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[34]  Laurie J. Hendren,et al.  Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C , 1996, POPL '96.

[35]  Richard M. Stallman,et al.  Using and Porting the GNU Compiler Collection , 2000 .

[36]  Bjarne Stroustrup,et al.  The Design and Evolution of C , 1994 .

[37]  David Grove,et al.  Optimization of Object-Oriented Programs Using Static Class Hierarchy Analysis , 1995, ECOOP.

[38]  George Cybenko,et al.  D'Agents: Applications and performance of a mobile‐agent system , 2002, Softw. Pract. Exp..

[39]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[41]  Joel H. Saltz,et al.  Efficient Execution of Multi-query Data Analysis Batches Using Compiler Optimization Strategies , 2003, LCPC.

[42]  Liviu Iftode,et al.  Spatial programming with smart messages for networks of embedded systems , 2002 .

[43]  Marc Tremblay,et al.  The visual instruction set (VIS) in UltraSPARC , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[44]  Ken Kennedy,et al.  Reducing and Vectorizing Procedures for Telescoping Languages , 2004, International Journal of Parallel Programming.

[45]  Jack W. Davidson,et al.  A study of a C function inliner , 1988, Softw. Pract. Exp..

[46]  Jason Maassen,et al.  Object-based collective communication in Java , 2001, JGI '01.

[47]  Mark J. Clement,et al.  DOGMA: Distributed Object Group Management Architecture , 1998 .

[48]  Sungdo Moon,et al.  Evaluation of predicated array data-flow analysis for automatic parallelization , 1999, PPoPP '99.

[49]  I. A. Getting,et al.  The Global Positioning System , 1993 .

[50]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[51]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[52]  Youfeng Wu,et al.  Comprehensive Redundant Load Elimination for the IA-64 Architecture , 1999, LCPC.

[53]  Pedro C. Diniz,et al.  Coarse-grain pipelining on multiple FPGA architectures , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[54]  Paul Feautrier,et al.  Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.

[55]  Paramvir Bahl,et al.  RADAR: an in-building RF-based user location and tracking system , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[56]  Deborah Estrin,et al.  Directed diffusion: a scalable and robust communication paradigm for sensor networks , 2000, MobiCom '00.

[57]  Charles N. Fischer,et al.  Crafting a Compiler , 1988 .

[58]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[59]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, II: Hierarchical multilevel memories , 1992, Algorithmica.

[60]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[61]  Michael Stonebraker,et al.  The SEQUOIA 2000 Project , 1993, SSD.

[62]  P. Feautrier Some Eecient Solutions to the Aane Scheduling Problem Part Ii Multidimensional Time , 1992 .

[63]  Lawrence Rauchwerger,et al.  Standard Templates Adaptive Parallel Library (STAPL) , 1998, LCR.

[64]  Scott A. Mahlke,et al.  High-level synthesis of nonprogrammable hardware accelerators , 2000, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors.

[65]  Pedro C. Diniz,et al.  Bridging the Gap between Compilation and Synthesis in the DEFACTO System , 2001, LCPC.

[66]  David A. Padua,et al.  Monotonic evolution: an alternative to induction variable substitution for dependence analysis , 2001, ICS '01.

[67]  Andreas Krall,et al.  Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.

[68]  Vivek Sarkar,et al.  Array SSA form and its use in parallelization , 1998, POPL '98.

[69]  Jason Maassen,et al.  Ibis: an efficient Java-based grid programming environment , 2002, JGI '02.

[70]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[71]  Jong-Deok Choi,et al.  Efficient flow-sensitive interprocedural computation of pointer-induced aliases and side effects , 1993, POPL '93.

[72]  D. Michie “Memo” Functions and Machine Learning , 1968, Nature.

[73]  Russell W. Quong,et al.  ANTLR: A predicated‐LL(k) parser generator , 1995, Softw. Pract. Exp..

[74]  Pedro C. Diniz,et al.  A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[75]  Andrew Ayers,et al.  Aggressive inlining , 1997, PLDI '97.

[76]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[77]  Geoffrey C. Fox,et al.  MpiJava: A Java Interface to MPI , 1998 .

[78]  Adam J. Ferrari JPVM: Network Parallel Computing in Java , 1997 .

[79]  Alexandru Nicolau,et al.  Fractal Matrix Multiplication: A Case Study on Portability of Cache Performance , 2001, Algorithm Engineering.

[80]  Saman Amarasinghe,et al.  Parallelizing Compiler Techniques Based on Linear Inequalities , 1997 .

[81]  Raymond Lo,et al.  Register promotion by sparse partial redundancy elimination of loads and stores , 1998, PLDI.

[82]  Linda G. DeMichiel,et al.  Extending Relational Database Technology for New Applications , 1994, IBM Syst. J..

[83]  William Pugh,et al.  Counting solutions to Presburger formulas: how and why , 1994, PLDI '94.

[84]  Masaru Tomita,et al.  Efficient parsing for natural language , 1985 .

[85]  Rudolf Eigenmann,et al.  Supporting Realistic OpenMP Applications on a Commodity Cluster of Workstations , 2003, WOMPAT.

[86]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[87]  Rajiv Gupta,et al.  Load-reuse analysis: design and evaluation , 1999, PLDI '99.

[88]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[89]  Philip Levis,et al.  Maté: a tiny virtual machine for sensor networks , 2002, ASPLOS X.

[90]  A. Harter,et al.  A distributed location system for the active office , 1994, IEEE Network.

[91]  Viktor Kuncak,et al.  Role analysis , 2002, POPL '02.

[92]  Hongjun Lu,et al.  Workload Scheduling for Multiple Query Processing , 1995, Inf. Process. Lett..

[93]  N. V. Kallur,et al.  A Hierarchical Data Archiving and Processing System to Generate Custom Tailored Products From AVHRR Data , 2004 .

[94]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[95]  Laxmikant V. Kalé,et al.  Supporting dynamic parallel object arrays , 2003, Concurr. Comput. Pract. Exp..

[96]  Ulrich Kremer Compilers for power and energy management , 2002, ISLPED '02.

[97]  M. Luisa Córdoba Cabeza,et al.  CacheSim: a cache simulator for teaching memory hierarchy behaviour , 1999, ITiCSE '99.

[98]  Jingling Xue Automating Non-Unimodular Loop Transformations for Massive Parallelism , 1994, Parallel Comput..

[99]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[100]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[101]  Michael Philippsen,et al.  A more efficient RMI for Java , 1999, JAVA '99.

[102]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[103]  Neville Churcher,et al.  A Generated Parser of C , 2001 .

[104]  W. Kelly,et al.  Code generation for multiple mappings , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[105]  W. Hwu,et al.  Accurate and efficient predicate analysis with binary decision diagrams , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[106]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, I: Two-level memories , 2005, Algorithmica.

[107]  Mitsuhisa Sato,et al.  Design of OpenMP Compiler for an SMP Cluster , 1999 .

[108]  Henry G. Dietz,et al.  Common Subexpression Induction , 1992, ICPP.

[109]  Laurie J. Hendren,et al.  Practical virtual method call resolution for Java , 2000, OOPSLA '00.

[110]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[111]  Henry G. Dietz,et al.  Compiling for SIMD Within a Register , 1998, LCPC.

[112]  Gabriel Antoniu,et al.  An Efficient and Transparent Thread Migration Scheme in the PM2 Runtime System , 1999, IPPS/SPDP Workshops.

[113]  Kevin Skadron,et al.  Power issues related to branch prediction , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[114]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[115]  Zhen Fang,et al.  The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[116]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[117]  Saman P. Amarasinghe,et al.  Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[118]  Benoît Meister,et al.  Automatic memory layout transformations to optimize spatial locality in parameterized loop nests , 2000, CARN.

[119]  Daniel Marques,et al.  Collective Operations in an Application-level Fault Tolerant MPI System , 2003 .

[120]  Barbara G. Ryder,et al.  Interprocedural modification side effect analysis with pointer aliasing , 1993, PLDI '93.

[121]  Jong-Deok Choi,et al.  Interprocedural pointer alias analysis , 1999, TOPL.

[122]  Jeremy Manson,et al.  JSR-133: Java Memory Model and Thread Specification , 2003 .

[123]  Li Xu Program redundancy analysis and optimization to improve memory performance , 2003 .

[124]  Alexandru Nicolau,et al.  A language for conveying the aliasing properties of dynamic, pointer-based data structures , 1994, Proceedings of 8th International Parallel Processing Symposium.

[125]  Seth Copen Goldstein,et al.  PipeRench: a co/processor for streaming multimedia acceleration , 1999, ISCA.

[126]  David A. Carlson,et al.  Multimedia extensions for a 550-MHz RISC microprocessor , 1997 .

[127]  William Pugh,et al.  Optimization within a unified transformation framework , 1996 .

[128]  Ken Kennedy,et al.  Parascope:a Parallel Programming Environment , 1988 .

[129]  Laxmikant V. Kalé,et al.  Emulating petaflops machines and blue gene , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[130]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[131]  Pierre Boulet,et al.  Loop Parallelization Algorithms: From Parallelism Extraction to Code Generation , 1998, Parallel Comput..

[132]  Rudolf Eigenmann,et al.  Nonlinear and Symbolic Data Dependence Testing , 1998, IEEE Trans. Parallel Distributed Syst..

[133]  Constantine D. Polychronopoulos,et al.  The structure of parafrase-2: an advanced parallelizing compiler for C and FORTRAN , 1990 .

[134]  Kenneth E. Batcher STARAN parallel processor system hardware , 1974, AFIPS '74.

[135]  Dirk Grunwald,et al.  Reducing branch costs via branch alignment , 1994, ASPLOS VI.

[136]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[137]  R. Ferreira,et al.  Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[138]  Reinhard Wilhelm,et al.  Solving shape-analysis problems in languages with destructive updating , 1998, TOPL.

[139]  Yong Wang,et al.  Energy-efficient computing for wildlife tracking: design tradeoffs and early experiences with ZebraNet , 2002, ASPLOS X.

[140]  Wendi B. Heinzelman,et al.  Adaptive protocols for information dissemination in wireless sensor networks , 1999, MobiCom.

[141]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[142]  Csaba Andras Moritz,et al.  Parallelizing applications into silicon , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[143]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[144]  Rudolf Eigenmann,et al.  The range test: a dependence test for symbolic, non-linear expressions , 1994, Proceedings of Supercomputing '94.

[145]  Chris Hankin,et al.  Abstract Interpretation of Declarative Languages , 1987 .

[146]  Calvin Lin,et al.  An annotation language for optimizing software libraries , 1999, DSL '99.

[147]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[148]  Luiz A. DeRose,et al.  Compiler techniques for MATLAB programs , 1996 .

[149]  Rinku Gupta,et al.  Static analysis of parame-terized loop nests for energy e?cient use of data caches , 2001 .

[150]  Bruce A. Draper,et al.  The Cameron Project: High-Level Programming of Image Processing Applications on Reconfigurable Computing Machines 1 , 1998 .

[151]  Rafael Asenjo,et al.  Accurate Shape Analysis for Recursive Data Structures , 2000, LCPC.

[152]  David A. Padua,et al.  Containers on the Parallelization of General-Purpose Java Programs , 2004, International Journal of Parallel Programming.

[153]  Thorsten von Eicken,et al.  Interfacing Java to the virtual interface architecture , 1999, JAVA '99.

[154]  Steve Johnson,et al.  Compiling C for vectorization, parallelization, and inline expansion , 1988, PLDI '88.

[155]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[156]  Richard M. Karp,et al.  The Organization of Computations for Uniform Recurrence Equations , 1967, JACM.

[157]  T. Kurc,et al.  Efficient Execution of Multiple Query Workloads in Data Analysis Applications , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[158]  David A. Padua,et al.  Gated SSA-based demand-driven symbolic analysis for parallelizing compilers , 1995, ICS '95.

[159]  Dennis Gannon,et al.  Sage++: An Object-Oriented Toolkit and Class Library for Building Fortran and C++ Restructuring Tool , 1994 .

[160]  Alexandru Nicolau,et al.  A general data dependence test for dynamic, pointer-based data structures , 1994, PLDI '94.

[161]  Pradeep K. Dubey,et al.  How Multimedia Workloads Will Change Processor Design , 1997, Computer.

[162]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[163]  Jason Maassen,et al.  GMI: Flexible and Efficient Group Method Invocation for Parallel Programming , 2002 .

[164]  Ulrich Kremer,et al.  The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction , 2003, PLDI '03.

[165]  Laxmikant V. Kalé,et al.  A framework for collective personalized communication , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[166]  Israel Koren,et al.  Jmpi: Implementing The Message Passing Interface Standard In Java , 2000 .

[167]  Dror Eliezer Maydan Accurate analysis of array references , 1993 .

[168]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.

[169]  Constantine D. Polychronopoulos,et al.  Symbolic analysis for parallelizing compilers , 1996, TOPL.

[170]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[171]  Hanspeter Moessenboeck,et al.  Coco/R - A Generator for Fast Compiler Front Ends , 1990 .

[172]  David E. Culler,et al.  The nesC language: A holistic approach to networked embedded systems , 2003, PLDI.

[173]  Joel H. Saltz,et al.  Compiling Data Intensive Applications with Spatial Coordinates , 2000, LCPC.

[174]  James M. Rehg,et al.  A Compilation Framework for Power and Energy Management on Mobile Computers , 2001, LCPC.

[175]  David F. Bacon,et al.  Fast static analysis of C++ virtual function calls , 1996, OOPSLA '96.

[176]  David A. Padua,et al.  MaJIC: compiling MATLAB for speed and responsiveness , 2002, PLDI '02.

[177]  Monica S. Lam,et al.  Efficient context-sensitive pointer analysis for C programs , 1995, PLDI '95.

[178]  B. Ramakrishna Rau,et al.  Efficient design space exploration in PICO , 2000, CASES '00.

[179]  David A. Padua,et al.  Techniques for the translation of MATLAB programs into Fortran 90 , 1999, TOPL.

[180]  Laxmikant V. Kalé,et al.  Run-Time Support for Adaptive Load Balancing , 2000, IPDPS Workshops.

[181]  Pedro C. Diniz,et al.  Compiler-generated communication for pipelined FPGA applications , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[182]  Carl Ebeling,et al.  Specifying and compiling applications for RaPiD , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[183]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[184]  Xin Yuan,et al.  Branch Elimination via Multi-variable Condition Merging , 2003, Euro-Par.

[185]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[186]  Thomas R. Gross,et al.  Static conflict analysis for multi-threaded object-oriented programs , 2003, PLDI '03.

[187]  Pedro C. Diniz,et al.  Using estimates from behavioral synthesis tools in compiler-directed design space exploration , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[188]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[189]  David B. Whalley,et al.  Avoiding conditional branches by code replication , 1995, PLDI '95.

[190]  Keith D. Cooper,et al.  An efficient static analysis algorithm to detect redundant memory operations , 2002, MSP/ISMM.

[191]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[192]  Keith D. Cooper,et al.  Value-driven redundancy elimination , 1996 .

[193]  Diego Puppin Convergent scheduling : a flexible and extensible scheduling framework for clustered VLIW architectures , 2002 .

[194]  Xavier Martorell,et al.  NanosCompiler: A Research Platform for OpenMP Extensions , 1999 .

[195]  Reinhard Wilhelm,et al.  Parametric shape analysis via 3-valued logic , 1999, POPL '99.

[196]  Sava Mintchev Writing Programs in JavaMPI , 1997 .

[197]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[198]  Joel H. Saltz,et al.  Exploiting functional decomposition for efficient parallel processing of multiple data analysis queries , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[199]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[200]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[201]  Erik Ruf,et al.  Effective synchronization removal for Java , 2000, PLDI '00.

[202]  Alok Aggarwal,et al.  Hierarchical memory with block transfer , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[203]  Mateo Valero,et al.  Eliminating cache conflict misses through XOR-based placement functions , 1997, ICS '97.

[204]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[205]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[206]  Micah Beck,et al.  Compiler-Assisted Checkpointing , 1994 .

[207]  Mats Brorsson,et al.  OdinMP/CCp - a portable implementation of OpenMP for C , 2000, Concurr. Pract. Exp..

[208]  Jack Minker,et al.  Multiple Query Processing in Deductive Databases using Query Graphs , 1986, VLDB.

[209]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, SIGP.

[210]  Saman Amarasinghe,et al.  The suif compiler for scalable parallel machines , 1995 .

[211]  Frank Pfenning,et al.  Eliminating array bound checking through dependent types , 1998, PLDI.

[212]  Alain Deutsch,et al.  Interprocedural may-alias analysis for pointers: beyond k-limiting , 1994, PLDI '94.

[213]  Anthony Skjellum,et al.  A framework for high‐performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low‐level kernels , 2002, Concurr. Comput. Pract. Exp..

[214]  Geoffrey C. Fox,et al.  MPJ: MPI-like message passing for Java , 2000 .

[215]  Isak Jonsson,et al.  Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms , 1998, PARA.

[216]  Mark S. Squillante,et al.  Processor Allocation in Multiprogrammed Distributed-Memory Parallel Computer Systems , 1997, J. Parallel Distributed Comput..

[217]  David A. Padua,et al.  On the Automatic Parallelization of the Perfect Benchmarks , 1998, IEEE Trans. Parallel Distributed Syst..

[218]  Joel H. Saltz,et al.  Run-time and compile-time support for adaptive irregular problems , 1994, Proceedings of Supercomputing '94.

[219]  Susmita Sur-Kolay,et al.  Combined instruction and loop parallelism in array synthesis for FPGAs , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[220]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[221]  Franco P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.

[222]  Samuel P. Midkiff,et al.  Compiling programs with user parallelism , 1990 .

[223]  Bernhard Steffen,et al.  Lazy code motion , 1992, PLDI '92.

[224]  Markus Schordan,et al.  Parallel object‐oriented framework optimization , 2004, Concurr. Comput. Pract. Exp..

[225]  David A. Padua,et al.  Basic compiler algorithms for parallel programs , 1999, PPoPP '99.

[226]  J. Cocke Global common subexpression elimination , 1970, Symposium on Compiler Optimization.

[227]  Keith D. Cooper,et al.  Optimizing for reduced code space using genetic algorithms , 1999, LCTES '99.

[228]  Roy Dz-Ching Ju,et al.  A new algorithm for scalar register promotion based on SSA form , 1998, PLDI '98.

[229]  Rudolf Eigenmann,et al.  Polaris: A New-Generation Parallelizing Compiler for MPPs , 1993 .

[230]  John Wawrzynek,et al.  Adapting software pipelining for reconfigurable computing , 2000, CASES '00.

[231]  Andy Hopper,et al.  The Anatomy of a Context-Aware Application , 1999, Wirel. Networks.

[232]  Martin Griebl,et al.  Code generation in the polytope model , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[233]  Larry Carter,et al.  Memory hierarchy considerations for fast transpose and bit-reversals , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[234]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[235]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[236]  George C. Necula,et al.  The design and implementation of a certifying compiler , 1998, PLDI.

[237]  Zhao Zhang,et al.  Cache-Optimal Methods for Bit-Reversals , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[238]  Vikram S. Adve,et al.  High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[239]  Yunheung Paek,et al.  The Access Region Test , 1999, LCPC.

[240]  Ken Kennedy,et al.  A technique for summarizing data access and its use in parallelism enhancing transformations , 1989, PLDI '89.

[241]  David E. Culler,et al.  Jaguar: enabling efficient communication and I/O in Java , 2000 .

[242]  Michael F. P. O'Boyle,et al.  Feedback Assisted Iterative Compilation , 2000 .

[243]  Rajiv Gupta,et al.  Interprocedural conditional branch elimination , 1997, PLDI '97.

[244]  Rudolf Eigenmann,et al.  Idiom recognition in the Polaris parallelizing compiler , 1995, ICS '95.

[245]  Andy Hopper,et al.  Implementing a Sentient Computing System , 2001, Computer.

[246]  Bharat K. Bhargava,et al.  Multiple-Query Optimization at Algorithm-Level , 1994, Data Knowl. Eng..

[247]  Paul H. J. Kelly,et al.  An exhaustive evaluation of row-major, column-major and Morton layouts for large two-dimensional arrays , 2003 .

[248]  Saman P. Amarasinghe,et al.  Convergent scheduling , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[249]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[250]  Albert Cohen,et al.  Putting Polyhedral Loop Transformations to Work , 2003, LCPC.

[251]  Scott A. Mahlke,et al.  Profile‐guided automatic inline expansion for C programs , 1992, Softw. Pract. Exp..

[252]  Prithviraj Banerjee,et al.  Static array storage optimization in MATLAB , 2003, PLDI '03.

[253]  Jack W. Davidson,et al.  Subprogram Inlining: A Study of its Effects on Program Execution Time , 1992, IEEE Trans. Software Eng..

[254]  Wayne Luk,et al.  Pipeline vectorization for reconfigurable systems , 1999, Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00375).

[255]  Utpal Banerjee,et al.  Loop Transformations for Restructuring Compilers: The Foundations , 1993, Springer US.

[256]  Toshiaki Yasue,et al.  An Empirical Study of Method In-lining for a Java Just-in-Time Compiler , 2002, Java Virtual Machine Research and Technology Symposium.

[257]  Siegfried Benkner VFC: The Vienna Fortran Compiler , 1999, Sci. Program..

[258]  Jim Waldo,et al.  The Jini architecture for network-centric computing , 1999, CACM.

[259]  Kenneth Steiglitz,et al.  Testing for cycles in infinite graphs with periodic structure , 1987, STOC.

[260]  Liviu Iftode,et al.  Toward a security architecture for smart messages: challenges, solutions, and open issues , 2003, 23rd International Conference on Distributed Computing Systems Workshops, 2003. Proceedings..

[261]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[262]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[263]  Dennis Gannon,et al.  HPC++: experiments with the parallel standard template library , 1997, ICS '97.

[264]  Daniel Marques,et al.  C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.

[265]  Markus Schordan,et al.  Treating a user-defined parallel library as a domain-specific language , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[266]  Michael F. P. O'Boyle,et al.  MARS: A Distributed Memory Approach to Shared Memory Compilation , 1998, LCR.

[267]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[268]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[269]  Christoforos E. Kozyrakis,et al.  How to solve the current memory access and data transfer bottlenecks: at the processor architecture or at the compiler level , 2000, DATE '00.

[270]  Robert Scheifler,et al.  An analysis of inline substitution for a structured programming language , 1977, CACM.

[271]  Cheryl McCosh,et al.  Type-based specialization in a telescoping compiler for Matlab , 2003 .

[272]  Liviu Iftode,et al.  Cooperative computing for distributed embedded systems , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[273]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[274]  James R. Larus,et al.  Detecting conflicts between structure accesses , 1988, PLDI '88.

[275]  Ken Kennedy,et al.  Automatic Type-Driven Library Generation for Telescoping Languages , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[276]  Thomas R. Gross,et al.  Exploiting task and data parallelism on a multicomputer , 1993, PPOPP '93.

[277]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[278]  Katsunobu Muroi,et al.  A SIMDizing C Compiler for the Mitsubishi Electric Neuro4 Processor Array , 1996 .

[279]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[280]  Edith Cohen,et al.  Strongly polynomial-time and NC algorithms for detecting cycles in periodic graphs , 1993, JACM.

[281]  Rajesh Gupta Architectural adaptation in AMRM machines , 2000, Proceedings IEEE Computer Society Workshop on VLSI 2000. System Design for a System-on-Chip Era.

[282]  David A. Padua,et al.  Issues in the Optimization of Parallel Programs , 1990, ICPP.

[283]  Rainer Leupers,et al.  Function inlining under code size constraints for embedded processors , 1999, 1999 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (Cat. No.99CH37051).

[284]  Alan Jay Smith,et al.  Design and characterization of the Berkeley multimedia workload , 2002, Multimedia Systems.

[285]  David Detlefs,et al.  Inlining of Virtual Methods , 1999, ECOOP.

[286]  Jack J. Dongarra,et al.  Vectorizing compilers: a test suite and results , 1988, Proceedings. SUPERCOMPUTING '88.

[287]  Mahmut T. Kandemir,et al.  Influence of compiler optimizations on system power , 2000, Proceedings 37th Design Automation Conference.

[288]  Aart J. C. Bik,et al.  Automatic Intra-Register Vectorization for the Intel® Architecture , 2002, International Journal of Parallel Programming.

[289]  Indranil Gupta,et al.  On scalable and efficient distributed failure detectors , 2001, PODC '01.

[290]  William Adjie-Winoto,et al.  The design and implementation of an intentional naming system , 2000, OPSR.

[291]  Geoffrey C. Fox,et al.  Parallel Computing Works , 1994 .

[292]  Andrew A. Chien,et al.  Analysis of Dynamic Structures for Efficient Parallel Execution , 1993, LCPC.

[293]  James R. Larus,et al.  Branch prediction for free , 1993, PLDI '93.

[294]  Aart J. C. Bik,et al.  Automatic Detection of Saturation and Clipping Idioms , 2002, LCPC.

[295]  James E. Smith,et al.  A study of branch prediction strategies , 1981, ISCA '98.

[296]  Manfred P. Stadel,et al.  A variation of Knoop, Rüthing, and Steffen's Lazy Code Motion , 1993, SIGP.

[297]  Thomas M. Conte,et al.  Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[298]  Apan Qasem,et al.  Improving Performance with Integrated Program Transformations , 2004 .

[299]  Mahmut T. Kandemir,et al.  The design and use of simplePower: a cycle-accurate energy estimation tool , 2000, Proceedings 37th Design Automation Conference.

[300]  Pierre Jouvelot,et al.  Semantical interprocedural parallelization: an overview of the PIPS project , 1991 .

[301]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[302]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[303]  Tao Yang,et al.  Program transformation and runtime support for threaded MPI execution on shared-memory machines , 2000, TOPL.

[304]  Joel H. Saltz,et al.  Active Proxy-G: Optimizing the Query Execution Process in the Grid , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[305]  Volker Strumpen,et al.  Portable Checkpointing for Heterogenous Architectures , 1997, International Symposium on Fault-Tolerant Computing.

[306]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[307]  Monica S. Lam,et al.  Efficient and exact data dependence analysis , 1991, PLDI '91.

[308]  Dirk Grunwald,et al.  Reducing indirect function call overhead in C++ programs , 1994, POPL '94.

[309]  Jameela Al-Jaroodi,et al.  A comparative study of parallel and distributed Java projects for heterogeneous systems , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[310]  David Grove,et al.  Adaptive online context-sensitive inlining , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[311]  M. Schlansker,et al.  On Predicated Execution , 1991 .

[312]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[313]  Ken Kennedy,et al.  Optimizing strategies for telescoping languages: procedure strength reduction and procedure vectorization , 2001, ICS '01.

[314]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[315]  Liviu Iftode,et al.  Self-routing in pervasive computing environments using smart messages , 2003, Proceedings of the First IEEE International Conference on Pervasive Computing and Communications, 2003. (PerCom 2003)..

[316]  Ruby B. Lee,et al.  Mapping of application software to the multimedia instructions of general-purpose microprocessors , 1997, Electronic Imaging.