论文信息 - c-level programming of parallel coprocessor accelerators

c-level programming of parallel coprocessor accelerators

We believe that FPGA-like parallel coprocessor accelerators can be programmed efficiently at the “C level” of abstraction. In order to support this claim we define an abstract architectural model of accelerators that conveys the kind of high-level behavior and performance characteristics that the von Neumann model conveys to programmers of conventional processors. Using the model as a guide we define a programming language and compilation strategy that: 1. do not impose programming style restrictions that are not inherent in the model, 2. do not introduce serious inefficiencies, and 3. are performance portable across implementations of the model. In this dissertation I describe C-level programming of accelerators broadly, and make three particular contributions to the programmability of accelerators. Enhanced loop flattening is a new method for translating loop nests with arbitrary static control flow into a form that can be efficiently pipelined with conventional algorithms designed for simple loops. This method advances the goal of supporting a wide set of programming styles with reasonable efficiency. Parallel accelerators have statically managed resources—like local memories—that vary widely in capacity from one implementation to the next. In order to get close to peak performance, applications must be tuned to the specific resources available in a given implementation, and empirical auto-tuning is an attractive way to do that. I propose and evaluate a new probabilistic auto-tuning method that elegantly handles situation where many possible configurations of the application fail to work at all because they exceed some architectural resource limit. For many applications, achieving good performance on parallel accelerators requires deep loop pipelining, which requires dramatically reordering the individual operations in the application. Local dependencies between operations can be respected by compilers relatively easily, but non-local dependencies force implementations to choose between conservatively not reordering operations (which might kill performance), proving that reordering preserves the meaning of the program (which is impossible in the general case), or making unsound transformations (which programmers generally dislike). I propose a mostly sequential operational semantics for C-level streaming languages targeted at parallel accelerators that offers enough flexibility to the implementation to achieve good performance, deviates from conventional program-order semantics in fairly modest and understandable ways, and provides tools with which the programmer can control the reordering performed by the implementation. These innovations are evaluated in the context of Macah, a new C-like language developed in the Mosaic group at the University of Washington. For validation we use a number of compute-intensive benchmarks developed by members of the Mosaic group and other contributors.

Carl Ebeling | Benjamin Ylvisaker | C. Ebeling | Benjamin Ylvisaker

[1] T. Knight,et al. Pathfinder : A Negotiation-Based Performance-Driven Router for FPGAs , 2012 .

[2] Alan Edelman,et al. Language and compiler support for auto-tuning variable-accuracy algorithms , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[3] Chi-Bang Kuan,et al. Automated Empirical Optimization , 2011, Encyclopedia of Parallel Computing.

[4] Srihari Cadambi,et al. A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[5] William J. Dally,et al. Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures , 2010, SPAA '10.

[6] Dah-Jye Lee,et al. A Comparison Study on Implementing Optical Flow and Digital Communications on FPGAs and GPUs , 2010, TRETS.

[7] Anders Logg,et al. DOLFIN: Automated finite element computing , 2010, TOMS.

[8] Sean Rul,et al. An experimental study on performance portability of OpenCL kernels , 2010, HiPC 2010.

[9] Apan Qasem,et al. Evaluating the Role of Optimization-Specific Search Heuristics in Effective Autotuning ? , 2010 .

[10] Cristian Grozea,et al. FPGA vs. Multi-core CPUs vs. GPUs: Hands-On Experience with a Sorting Application , 2010, Facing the Multicore-Challenge.

[11] Tao Wang,et al. An Implementation of Viterbi Algorithm on GPU , 2009, 2009 First International Conference on Information Science and Engineering.

[12] Carl Ebeling,et al. Static versus scheduled interconnect in Coarse-Grained Reconfigurable Arrays , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[13] Wayne Luk,et al. Exploring Reconfigurable Architectures for Tree-Based Option Pricing Models , 2009, TRETS.

[14] Walter F. Tichy,et al. Atune-IL: An Instrumentation Language for Auto-tuning Parallel Applications , 2009, Euro-Par.

[15] Vahid Tabatabaee,et al. Tuning parallel applications in parallel , 2009, Parallel Comput..

[16] Jason Cong,et al. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[17] Alan Edelman,et al. PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[18] Chun Chen,et al. Model-guided autotuning of high-productivity languages for petascale computing , 2009, HPDC '09.

[19] Jason Cong,et al. High-performance CUDA kernel execution on FPGAs , 2009, ICS.

[20] Chun Chen,et al. A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[21] P. Sadayappan,et al. Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[22] Christoph A. Schaefer,et al. Reducing search space of auto-tuners using parallel patterns , 2009, 2009 ICSE Workshop on Multicore Software Engineering.

[23] Catalin Bogdan Ciobanu,et al. Wave field synthesis for 3D audio: architectural prospectives , 2009, CF '09.

[24] Victor Pankratius,et al. Auto-tuning support for manycore applications: perspectives for operating systems and compilers , 2009, OPSR.

[25] Wayne Luk,et al. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation , 2009, FPGA '09.

[26] Scott Hauck,et al. FPGA-based front-end electronics for positron emission tomography , 2009, FPGA '09.

[27] Carl Ebeling,et al. SPR: an architecture-adaptive CGRA mapping tool , 2009, FPGA '09.

[28] Gérard Boudol,et al. Relaxed memory models: an operational approach , 2009, POPL '09.

[29] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[30] A. DeHon,et al. Pipelining saturated accumulation , 2005, IEEE Transactions on Computers.

[31] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[32] William J. Dally,et al. A tuning framework for software-managed memory hierarchies , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33] Meikang Qiu,et al. Timing optimization via nest-loop pipelining considering code size , 2008, Microprocess. Microsystems.

[34] Peter Y. K. Cheung,et al. Outer Loop Pipelining for Application Specific Datapaths in FPGAs , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[35] Changhee Lee,et al. Trash removal algorithm for fast construction of the elliptic Gabriel graph using Delaunay triangulation , 2008, Comput. Aided Des..

[36] Kevin Skadron,et al. Accelerating Compute-Intensive Applications with GPUs and FPGAs , 2008, 2008 Symposium on Application Specific Processors.

[37] Hans-Juergen Boehm,et al. Foundations of the C++ concurrency memory model , 2008, PLDI '08.

[38] Fang Zhong,et al. Parallel architecture for PCA image feature detection using FPGA , 2008, 2008 Canadian Conference on Electrical and Computer Engineering.

[39] Kristina Lerman,et al. Model-guided performance tuning of parameter values: A case study with molecular dynamics visualization , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[40] Jeff Mason,et al. CHiMPS: A C-level compilation flow for hybrid CPU-FPGA architectures , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[41] Joseph M. Lancaster,et al. A Banded Smith-Waterman FPGA Accelerator for Mercury BLASTP , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[42] R. C. Whaley,et al. Automated transformation for performance-critical kernels , 2007, LCSD '07.

[43] Albert Cohen,et al. Code-size conscious pipelining of imperfectly nested loops , 2007, MEDEA '07.

[44] Bradford L. Chamberlain,et al. Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[45] Michael F. P. O'Boyle,et al. Fast compiler optimisation evaluation using code-feature based performance prediction , 2007, CF '07.

[46] Maya Gokhale,et al. Matched Filter Computation on FPGA, Cell and GPU , 2007, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007).

[47] M. Butts,et al. A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing , 2007, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007).

[48] Georgi Gaydadjiev,et al. Architectural Exploration of the ADRES Coarse-Grained Reconfigurable Array , 2007, ARC.

[49] Richard W. Vuduc,et al. POET: Parameterized Optimizations for Empirical Tuning , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[50] Uday Bondhugula,et al. Automatic mapping of nested loops to FPGAS , 2007, PPoPP.

[51] V.K. Prasanna,et al. Preliminary Investigation of Advanced Electrostatics in Molecular Dynamics on Reconfigurable Computers , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[52] Scott A. Mahlke,et al. Streamroller:: automatic synthesis of prescribed throughput accelerator pipelines , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[53] T. Mohsenin,et al. An asynchronous array of simple processors for dsp applications , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[54] Rudolf Eigenmann,et al. Fast, automatic, procedure-level performance tuning , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[55] David A. Padua,et al. In search of a program generator to implement generic transformations for high-performance computing , 2006, Sci. Comput. Program..

[56] Carl Ebeling,et al. Reducing the Space Complexity of Pipelined Routing Using Modified Range Encoding , 2006, 2006 International Conference on Field Programmable Logic and Applications.

[57] Byoung Kyu Choi,et al. Elliptic Gabriel graph for finding neighbors in a point set and its application to normal vector estimation , 2006, Comput. Aided Des..

[58] Ken Kennedy,et al. Automatic tuning of whole applications using direct search and a performance-based transformation system , 2006, The Journal of Supercomputing.

[59] Carl Ebeling,et al. A Type Architecture for Hybrid Micro-Parallel Computers , 2006, FCCM.

[60] Mark Stephenson,et al. Automating the construction of compiler heuristics using machine learning , 2006 .

[61] Albert Cohen,et al. A Practical Method for Quickly Evaluating Program Optimizations , 2005, HiPEAC.

[62] Maya Gokhale,et al. Trident: an FPGA compiler framework for floating-point algorithms , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[63] William Thies,et al. Optimizing stream programs using linear state space analysis , 2005, CASES '05.

[64] Alexandru Nicolau,et al. Enhanced Loop Coalescing: A Compiler Technique for Transforming Non-uniform Iteration Spaces , 2005, ISHPC.

[65] Scott Hauck,et al. SPIHT image compression on FPGAs , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[66] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[67] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[68] Keshav Pingali,et al. Think globally, search locally , 2005, ICS '05.

[69] Grigori Fursin,et al. Probabilistic source-level optimisation of embedded programs , 2005, LCTES '05.

[70] Keith D. Cooper,et al. ACME: adaptive compilation made efficient , 2005, LCTES '05.

[71] William Thies,et al. Teleport messaging for distributed stream programs , 2005, PPoPP.

[72] João M. P. Cardoso. Dynamic loop pipelining in data-driven architectures , 2005, CF '05.

[73] Karl S. Hemmert,et al. An analysis of the double-precision floating-point FFT on FPGAs , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[74] Daniel S. Poznanovic,et al. Application development on the SRC Computers, Inc. systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[75] Keith D. Underwood,et al. RC-BLAST: towards a portable, cost-effective open source hardware implementation , 2005, IEEE International Parallel and Distributed Processing Symposium.

[76] Chun Chen,et al. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[77] Yuan Zhao,et al. Scalarization on Short Vector Machines , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[78] Steven G. Johnson,et al. The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[79] Jeremy Manson,et al. The Java memory model , 2005, POPL '05.

[80] David Pellerin,et al. Practical FPGA programming in C , 2005 .

[81] Anthony Widjaja,et al. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[82] Yuan Zhao,et al. Scalarization Using Loop Alignment and Loop Skewing , 2005, The Journal of Supercomputing.

[83] A.P. Kakarountas,et al. Speedups from partitioning software kernels to FPGA hardware in embedded SoCs , 2005, IEEE Workshop on Signal Processing Systems Design and Implementation, 2005..

[84] Jack Dongarra,et al. An Effective Empirical Search Method for Automatic Software Tuning , 2005 .

[85] Carl Ebeling,et al. QuickRoute: a fast routing algorithm for pipelined architectures , 2004, Proceedings. 2004 IEEE International Conference on Field- Programmable Technology (IEEE Cat. No.04EX921).

[86] I-Hsin Chung,et al. Using Information from Prior Runs to Improve Automated Tuning Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[87] Lai-Man Po,et al. Enhanced hexagonal search for fast block motion estimation , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[88] Edwin Hsing-Mean Sha,et al. General loop fusion technique for nested loops considering timing and code size , 2004, CASES '04.

[89] William J. Dally,et al. Evaluating the Imagine stream architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[90] Douglas L. Jones,et al. Fast searches for effective optimization phase sequences , 2004, PLDI '04.

[91] Margo I. Seltzer,et al. Using probabilistic reasoning to automate software tuning , 2004, SIGMETRICS '04/Performance '04.

[92] Guang R. Gao,et al. Single-dimension software pipelining for multi-dimensional loops , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[93] Dominique Lavenier,et al. Experience with a Hybrid Processor: K-Means Clustering , 2004, The Journal of Supercomputing.

[94] Keith D. Cooper,et al. Adaptive Optimizing Compilers for the 21st Century , 2002, The Journal of Supercomputing.

[95] Zoran Jovanovic,et al. Control Flow Regeneration for Software Pipelined Loops with Conditions , 2004, International Journal of Parallel Programming.

[96] Software pipelining: an effective scheduling technique for VLIW machines , 1988, SIGP.

[97] Allen,et al. Optimizing Compilers for Modern Architectures , 2004 .

[98] Seth Copen Goldstein,et al. C to Asynchronous Dataflow Circuits: An End-to-End Toolflow , 2004 .

[99] Jung Ho Ahn,et al. Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[100] Rudy Lauwereins,et al. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[101] William J. Dally,et al. Programmable Stream Processors , 2003, Computer.

[102] Yunheung Paek,et al. Finding effective optimization phase sequences , 2003, LCTES '03.

[103] William R. Mark,et al. Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..

[104] Saman P. Amarasinghe,et al. Meta optimization: improving compiler heuristics with machine learning , 2003, PLDI '03.

[105] Gang Ren,et al. A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[106] M. Forina,et al. Cluster analysis: significance, empty space, clustering tendency, non-uniformity. II--Empty Space index. , 2003, Annali di chimica.

[107] Herman Schmit,et al. Efficient application representation for HASTE: Hybrid Architectures with a Single, Transformable Executable , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[108] David I. August,et al. Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[109] Brad Calder,et al. Phi-predication for light-weight if-conversion , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[110] Scott A. Mahlke,et al. Predicate-aware scheduling: a technique for reducing resource constraints , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[111] John Wawrzynek,et al. Post-placement C-slow retiming for the xilinx virtex FPGA , 2003, FPGA '03.

[112] Mihai Budiu,et al. Spatial Computation — Summary of the Ph , 2003 .

[113] Tamara G. Kolda,et al. Optimization by Direct Search: New Perspectives on Some Classical and Modern Methods , 2003, SIAM Rev..

[114] Henry Hoffmann,et al. A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[115] Krishna V. Palem,et al. Software bubbles: using predication to compensate for aliasing in software pipelines , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[116] Seth Copen Goldstein,et al. Compiling Application-Specific Hardware , 2002, FPL.

[117] Brad L. Hutchings,et al. Sea Cucumber: A Synthesizing Compiler for FPGAs , 2002, FPL.

[118] Fan Xiao,et al. Uniformity testing using minimal spanning tree , 2002, Object recognition supported by user interaction for service robots.

[119] Michael F. P. O'Boyle,et al. Evaluating Iterative Compilation , 2002, LCPC.

[120] Philip H. Sweany,et al. Loop fusion for clustered VLIW architectures , 2002, LCTES/SCOPES '02.

[121] Randolph E. Harr,et al. Efficient pipelining of nested loops: unroll-and-squash , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[122] William Thies,et al. StreamIt: A Language for Streaming Applications , 2002, CC.

[123] George C. Necula,et al. CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs , 2002, CC.

[124] I. D. Coope,et al. A Convergent Variant of the Nelder–Mead Algorithm , 2002 .

[125] Henry Hoffmann,et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[126] Alexander J. Smola,et al. Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[127] Permalink. Mapping a Single Assignment Programming Language to Reconfigurable Systems , 2002 .

[128] Peter Mattson,et al. A programming system for the imagine media processor , 2002 .

[129] Rudy Lauwereins,et al. DRESC: a retargetable compiler for coarse-grained reconfigurable architectures , 2002, 2002 IEEE International Conference on Field-Programmable Technology, 2002. (FPT). Proceedings..

[130] Matthew R. Guthaus,et al. MiBench: A free, commercially representative embedded benchmark suite , 2001, Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538).

[131] Robert A. van de Geijn,et al. FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[132] Preeti Ranjan Panda,et al. SystemC - a modeling platform supporting multiple design abstractions , 2001, International Symposium on System Synthesis (IEEE Cat. No.01EX526).

[133] Eric Stotzer,et al. Software Pipelining Irregular Loops on the TMS320C6000 VLIW DSP Architecture , 2001, LCTES/OM.

[134] Kalyan Muthukumar,et al. Software Pipelining of Nested Loops , 2001, CC.

[135] William J. Dally,et al. Imagine: Media Processing with Streams , 2001, IEEE Micro.

[136] S. Ramachandran,et al. FPGA implementation of a novel, fast motion estimation algorithm for real-time video compression , 2001, FPGA '01.

[137] John Paul Shen,et al. Register renaming and scheduling for dynamic execution of predicated code , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[138] Alok N. Choudhary,et al. FPGA hardware synthesis from MATLAB , 2001, VLSI Design 2001. Fourteenth International Conference on VLSI Design.

[139] John Wawrzynek,et al. Adapting software pipelining for reconfigurable computing , 2000, CASES '00.

[140] Seth Copen Goldstein,et al. BitValue Inference: Detecting and Exploiting Narrow Bitwidth Computations , 2000, Euro-Par.

[141] John Wawrzynek,et al. Stream Computations Organized for Reconfigurable Execution (SCORE) , 2000, FPL.

[142] Mark Stephenson,et al. Bidwidth analysis with application to silicon compilation , 2000, PLDI '00.

[143] Maya Gokhale,et al. Stream-oriented FPGA computing in the Streams-C high level language , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[144] John Wawrzynek,et al. The Garp Architecture and C Compiler , 2000, Computer.

[145] Seth Copen Goldstein,et al. PipeRench: A Reconfigurable Architecture and Compiler , 2000, Computer.

[146] Daniel D. Gajski,et al. SPECC: Specification Language and Methodology , 2000 .

[147] Andrew W. Moore,et al. Q2: memory-based active learning for optimizing noisy continuous functions , 1998, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[148] Ranette Halverson,et al. A Study of Software Pipelining for Multi-dimensional Problems , 2000 .

[149] M. Budiu,et al. PipeRench: a coprocessor for streaming multimedia acceleration , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[150] Keith D. Cooper,et al. Optimizing for reduced code space using genetic algorithms , 1999, LCTES '99.

[151] Carl Ebeling,et al. Architecture design of reconfigurable pipelined datapaths , 1999, Proceedings 20th Anniversary Conference on Advanced Research in VLSI.

[152] Scott Hauck,et al. Adaptive Computing in NASA Multi-Spectral Image Processing , 1999 .

[153] Yossi Matias,et al. The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms , 1999, SIAM J. Comput..

[154] Bradford L. Chamberlain,et al. The case for high-level parallel programming in ZPL , 1998 .

[155] Lawrence Snyder,et al. The implementation and evaluation of fusion and contraction in array languages , 1998, PLDI '98.

[156] Maya Gokhale,et al. NAPA C: compiling for a hybrid RISC/FPGA architecture , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[157] Carl Ebeling,et al. Specifying and compiling applications for RaPiD , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[158] Ray Andraka,et al. A survey of CORDIC algorithms for FPGA based computers , 1998, FPGA '98.

[159] Joseph A. Fisher,et al. Clustered Instruction-Level Parallel Processors , 1998 .

[160] W. PeterM.,et al. FlatteningVLIW code generation for imperfectly nested loops , 1998 .

[161] Tao Yu,et al. Control mechanism for software pipelining on nested loop , 1997, Proceedings. Advances in Parallel and Distributed Computing.

[162] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[163] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[164] Sarita V. Adve,et al. Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[165] Guang R. Gao,et al. Identifying loops using DJ graphs , 1996, TOPL.

[166] Josep Llosa,et al. Swing module scheduling: a lifetime-sensitive approach , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[167] Allan L. Fisher,et al. Flattening and parallelizing irregular, recurrent loop nests , 1995, PPOPP '95.

[168] Scott A. Mahlke,et al. A comparison of full and partial predicated execution support for ILP processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[169] Amit Ganesh. Fusing loops with backward inter loop data dependence , 1994, SIGP.

[170] David F. Bacon,et al. Compiler transformations for high-performance computing , 1994, CSUR.

[171] B. Ramakrishna Rau,et al. Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[172] Shumeet Baluja,et al. A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning , 1994 .

[173] J. Ramanujam,et al. Optimal software pipelining of nested loops , 1994, Proceedings of 8th International Parallel Processing Symposium.

[174] Henry G. Dietz,et al. Loop Coalescing and Scheduling for Barrier MIMD Architectures , 1993, IEEE Trans. Parallel Distributed Syst..

[175] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[176] Scott A. Mahlke,et al. Reverse If-Conversion , 1993, PLDI '93.

[177] Robert K. Brayton,et al. ESPRESSO-SIGNATURE: A New Exact Minimizer for Logic Functions , 1993, 30th ACM/IEEE Design Automation Conference.

[178] Thomas Ball,et al. Slicing Programs with Arbitrary Control-flow , 1993, AADEBUG.

[179] Grant E. Haab,et al. Enhanced Modulo Scheduling For Loops With Conditional Branches , 1992, [1992] Proceedings the 25th Annual International Symposium on Microarchitecture MICRO 25.

[180] Ken Kennedy,et al. Relaxing SIMD control flow constraints using loop transformations , 1992, PLDI '92.

[181] Thomas W. Reps,et al. The use of program dependence graphs in software engineering , 1992, International Conference on Software Engineering.

[182] Scott A. Mahlke,et al. Using profile information to assist classic code optimizations , 1991, Softw. Pract. Exp..

[183] Jack J. Dongarra,et al. A comparative study of automatic vectorizing compilers , 1991, Parallel Comput..

[184] Lauren L. Smith. Vectorizing C compilers: how good are they? , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[185] Vivek Sarkar,et al. Compact representations for control dependence , 1990, PLDI '90.

[186] Jack Dongarra,et al. Automatic Blocking of Nested Loops , 1990 .

[187] Steve Johnson,et al. Compiling C for vectorization, parallelization, and inline expansion , 1988, PLDI '88.

[188] Constantine D. Polychronopoulos. Loop Coalesing: A Compiler Transformation for Parallel Machines , 1987, ICPP.

[189] David A. Padua,et al. Advanced compiler optimizations for supercomputers , 1986, CACM.

[190] Lawrence Snyder,et al. Type architectures, shared memory, and the corollary of modest potential , 1986 .

[191] Mark Weiser,et al. Program Slicing , 1981, IEEE Transactions on Software Engineering.

[192] Joseph A. Fisher,et al. Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[193] William W. Wadge,et al. Lucid, a nonprocedural language with iteration , 1977, CACM.

[194] John A. Nelder,et al. A Simplex Method for Function Minimization , 1965, Comput. J..