Adaptive explicitly parallel instruction computing

Current processors are programmed through a fixed interface called the Instruction Set Architecture (ISA). Consequently, a compiler targeting such a processor is forced to choose instructions from the provided instruction set. Often this instruction set is not a suitable match for the computational requirements of the application program. With in this context, we ask ourselves the following questions. (1) Can application performance be improved if the compiler had the freedom to wick the instruction set on a per application basis? (2) Can we build cost-effective processors that provide the ability to efficiently emulate compiler determined instruction sets and yet are not application specific? (3) Given that the desired processor capabilities are feasible, can the compiler determine an optimal set of instructions for a given application and generate code that can effectively exploit the processor capabilities? In this thesis, we provide sufficient evidence to answer these questions in the affirmative. Through a combination of architectural innovations and novel compilation techniques, this dissertation demonstrates that it is possible to attain significant improvement in performance, on general purpose and multimedia applications over comparable fixed ISA processors. First half of this dissertation describes this novel class of architectures, focusing on a specific subclass called Adaptive Explicitly Parallel Instruction Computing (ASPIC) architectures whose definition represents a collection of ideas intended to enable efficient reconfiguration of processor data paths. While ASPIC processor reconfiguration is affected by the executing program at runtime, the decisions of when and how to reconfigure are determined by the compiler and embedded in the application's executable. In the second half, a compilation framework targeting ASPIC processors is proposed. Several key compilation problems that need to be addressed in order to target ASPIC processors such as partitioning, instruction synthesis, configuration selection, resource allocation and scheduling are defined and efficient techniques for solving them are proposed. Finally, we describe the design of a simulation and performance monitoring framework for ASPIC architectures. How such architectures can be used to improve application performance is demonstrated using a set of programs from the SPEC and MediaBench benchmarks. Experimental results indicate the significant role architectural features of ASPIC processors play in masking the overheads of micro-architectural reconfiguration.

[1]  Harvey F. Silverman,et al.  Implementing a genetic algorithm on a parallel custom computing machine , 1995, Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[2]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[3]  Yale N. Patt,et al.  Checkpoint repair for out-of-order execution machines , 1987, ISCA '87.

[4]  Michael J. Flynn,et al.  PAM-Blox: high performance FPGA design for adaptive computing , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[5]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[6]  Arnaud Tisserand,et al.  FPGA implementation of real-time digital controllers using on-line arithmetic , 1997, FPL.

[7]  Ralph Wittig,et al.  OneChip: an FPGA processor with reconfigurable logic , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[8]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[9]  Juan M. Meneses,et al.  FPGA Implementation of the Block-Matching Algorithm for Motion Estimation in Image Coding , 1996, FPL.

[10]  Gerald Estrin,et al.  Organization of computer systems: the fixed plus variable structure computer , 1960, IRE-AIEE-ACM '60 (Western).

[11]  Uri C. Weiser,et al.  Intel MMX for multimedia PCs , 1997, Commun. ACM.

[12]  Peter Athanas,et al.  Finding lines and building pyramids with SPLASH 2 , 1994, Proceedings of IEEE Workshop on FPGA's for Custom Computing Machines.

[13]  Masaharu Imai,et al.  An integrated design environment for application specific integrated processor , 1991, [1991 Proceedings] IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[14]  A. Kempe On the Geographical Problem of the Four Colours , 1879 .

[15]  염흥렬,et al.  [서평]「Applied Cryptography」 , 1997 .

[16]  Gregory J. Chaitin,et al.  Register allocation & spilling via graph coloring , 1982, SIGPLAN '82.

[17]  A. G. Hirschbiel,et al.  A Novel ASIC Design Approach based on a New Machine Paradigm , 1990, ESSCIRC '90: Sixteenth European Solid-State Circuits Conference.

[18]  Peter A. Dinda,et al.  The CMU task parallel program suite , 1994 .

[19]  Gurindar S. Sohi,et al.  Tradeoffs in instruction format design for horizontal architectures , 1989, ASPLOS 1989.

[20]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[21]  Dean M. Tullsen,et al.  Simultaneous multithreading , 1996 .

[22]  Edward M. Riseman,et al.  The Inhibition of Potential Parallelism by Conditional Jumps , 1972, IEEE Transactions on Computers.

[23]  Michael J. Flynn,et al.  Detection and Parallel Execution of Independent Instructions , 1970, IEEE Transactions on Computers.

[24]  Rahul Razdan,et al.  PRISC: programmable reduced instruction set computers , 1994 .

[25]  Herman Schmit Incremental reconfiguration for pipelined applications , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[26]  Neil W. Bergmann,et al.  Comparing the performance of FPGA-based custom computers with general-purpose computers for DSP applications , 1994, Proceedings of IEEE Workshop on FPGA's for Custom Computing Machines.

[27]  S. Monaghan,et al.  Reconfigurable multi-bit processor for DSP applications in statistical physics , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[28]  Steven Trimberger,et al.  A time-multiplexed FPGA , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[29]  Ka Fai Cheng,et al.  Implementation of pre-processing and feature extraction of Chinese characters with FPGAs , 1994 .

[30]  Daniel P. Lopresti,et al.  Building and using a highly parallel programmable logic array , 1991, Computer.

[31]  E. Tau,et al.  A First Generation DPGA implementation , 1995 .

[32]  Art Lew,et al.  Programming with Functional Memory , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[33]  Daniel P. Lopresti Rapid implementation of a genetic sequence comparator using field-programmable logic arrays , 1991 .

[34]  James R. Goodman,et al.  Billion-Transistor Architectures - Guest Editors' Introduction. , 1997 .

[35]  Michel Fattouche,et al.  An FPGA implementation of a matched filter detector for spread spectrum communications systems , 1997, FPL.

[36]  Jeffrey D. Ullman,et al.  NP-Complete Scheduling Problems , 1975, J. Comput. Syst. Sci..

[37]  Laurent Moll,et al.  High-Energy Physics on DECPeRLe-1 Programmable Active Memory , 1995, Third International ACM Symposium on Field-Programmable Gate Arrays.

[38]  John L. Bruno,et al.  Deterministic Scheduling with Pipelined Processors , 1980, IEEE Transactions on Computers.

[39]  Reiner W. Hartenstein,et al.  Field-Programmable Logic Architectures, Synthesis and Applications , 1994, Lecture Notes in Computer Science.

[40]  Michael Rodeh,et al.  Global instruction scheduling for superscalar machines , 1991, PLDI '91.

[41]  B. Ramakrishna Rau,et al.  HMDES Version 2.0 Specification , 1996 .

[42]  B. Ramakrishna Rau,et al.  Achieving high levels of instruction-level parallelism with reduced hardware complexity , 1997 .

[43]  Bantwal R. Rau Dynamically scheduled VLIW processors , 1993, MICRO 1993.

[44]  Brent E. Nelson,et al.  A Hardware Genetic Algorithm for the Travelling Salesman Problem on SPLASH 2 , 1995, FPL.

[45]  Gurindar S. Sohi,et al.  Tradeoffs in instruction format design for horizontal architectures , 1989, ASPLOS III.

[46]  Mark Shand,et al.  Programmable active memories: reconfigurable systems come of age , 1996, IEEE Trans. Very Large Scale Integr. Syst..

[47]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[48]  B. Ramakrishna Rau,et al.  EPIC: An Architecture for Instruction-Level Parallel Processors , 2000 .

[49]  David J. Lilja,et al.  The Interaction of Compilation Technology and Computer Architecture , 1994, Springer US.

[50]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[51]  Carole Dulong,et al.  The IA-64 Architecture at Work , 1998, Computer.

[52]  Seth Copen Goldstein,et al.  Managing pipeline-reconfigurable FPGAs , 1998, FPGA '98.

[53]  Miroslaw Malek,et al.  Proceedings of the 9th annual symposium on Computer Architecture , 1982, ISCA 1982.

[54]  Jean Vuillemin,et al.  Programmable Active Memories: A Performance Assessment , 1992, Heinz Nixdorf Symposium.

[55]  Mark Shand,et al.  Hardware speedups in long integer multiplication , 1991, SIGARCH Comput. Archit. News.

[56]  Kanchi Gopinath,et al.  Register allocation in hyper-block for EPIC processors , 1999, PARCO.

[57]  J. P. Bennett A methodology for automated design of computer instruction sets , 1987 .

[58]  Richard G. Shoup Parameterized convolution filtering in an FPGA , 1994 .

[59]  I. Pyo,et al.  Application-driven design automation for microprocessor design , 1992, [1992] Proceedings 29th ACM/IEEE Design Automation Conference.

[60]  David Callahan,et al.  Register allocation via hierarchical graph coloring , 1991, PLDI '91.

[61]  Michael D. Smith,et al.  A high-performance microarchitecture with hardware-programmable functional units , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[62]  John Gray,et al.  Use of Reconfigurability in Variable-Length Code Detection at Video Rates , 1995, FPL.

[63]  Preston Briggs,et al.  Register allocation via graph coloring , 1992 .

[64]  Y. Y. H. Lam,et al.  FPGA implementation of a digital IQ demodulator using VHDL , 1997, FPL.

[65]  Joseph A. Fisher,et al.  Walk-Time Techniques: Catalyst for Architectural Change , 1997, Computer.

[66]  C. Luchini,et al.  [High speed]. , 1969, Revista De La Escuela De Odontologia, Universidad Nacional De Tucuman, Facultad De Medicina.

[67]  Milos D. Ercegovac,et al.  FPGA implementation of polynomial evaluation algorithms , 1995, Optics East.

[68]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS 1989.

[69]  Ing-Jer Huang,et al.  Generating Instruction Sets And Microarchitectures From Applications , 1994, IEEE/ACM International Conference on Computer-Aided Design.

[70]  Laurent Moll,et al.  Real time correlation-based stereo: algorithm, implementations and applications , 1993 .

[71]  M. Djunatan,et al.  A programmable real-time systolic processor architecture for image morphological operations, binary template matching and min/max filtering , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[72]  Milos D. Ercegovac,et al.  A digit-recurrence square root implementation for field programmable gate arrays , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[73]  Miriam Leeser,et al.  Implementing Filters with programmable logic , 1994 .

[74]  Ray Andraka,et al.  A survey of CORDIC algorithms for FPGA based computers , 1998, FPGA '98.

[75]  Mario R. Schaffner Processing by Data and Program Blocks , 1978, IEEE Transactions on Computers.

[76]  Ing-Jer Huang,et al.  Application-Driven Design Automation for Microprocessor Design , 1992, DAC.

[77]  Vivek Sarkar,et al.  The Raw Compiler Project , 1999 .

[78]  Pramod Viswanath,et al.  A quantitative analysis of processor-programmable logic interface , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[79]  Alexandru Nicolau,et al.  Measuring the Parallelism Available for Very Long Instruction Word Architectures , 1984, IEEE Transactions on Computers.

[80]  André DeHon,et al.  DPGA Utilization and Application , 1996, Fourth International ACM Symposium on Field-Programmable Gate Arrays.

[81]  Paul Chow,et al.  RACER: a reconfigurable constraint-length 14 Viterbi decoder , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[82]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[83]  Krishna V. Palem,et al.  Scheduling Time-Critical Instructions on RISC Machines , 1993, ACM Trans. Program. Lang. Syst..

[84]  Michael A. Rencher A Comparison Of FPGA Platforms Through SAR/ATR Algorithm Implementation , 1996 .

[85]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[86]  Jean-Michel Muller,et al.  Implementing On Line Arithmetic on PAM , 1994, FPL.

[87]  Laurence E. Turner,et al.  Implementation of Fast Fourier Transforms and Discrete Cosine Transforms in FPGAs , 1995, FPL.

[88]  Scott Hauck,et al.  High-performance carry chains for FPGAs , 1998, FPGA '98.

[89]  Scott A. Mahlke,et al.  Integrated predicated and speculative execution in the IMPACT EPIC architecture , 1998, ISCA.

[90]  Lars Wanhammar,et al.  A high speed 2-D discrete cosine transform chip , 1987, Integr..

[91]  Michael D. Smith,et al.  The Interaction of Compilation Technology and Computer Architecture , 1994 .

[92]  Scott A. Mahlke,et al.  A framework for balancing control flow and predication , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[93]  James E. Smith,et al.  The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[94]  John Wawrzynek,et al.  Fast module mapping and placement for datapaths in FPGAs , 1998, FPGA '98.

[95]  Quinn Jacobson,et al.  Trace processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[96]  Barbara B. Simons,et al.  A fast algorithm for multiprocessor scheduling , 1980, 21st Annual Symposium on Foundations of Computer Science (sfcs 1980).

[97]  J A Fisher,et al.  Instruction-Level Parallel Processing , 1991, Science.

[98]  Ing-Jer Huang,et al.  Synthesis of application specific instruction sets , 1995, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[99]  Anke Meyer-Bäse,et al.  COordinate Rotation DIgital Computer (CORDIC) Synthesis for FPGA , 1994, FPL.

[100]  Richard D. Ross An FPGA Implementation of ATR Using Embedded Ram for Control , 1997 .

[101]  John D. Villasenor,et al.  Configurable computing solutions for automatic target recognition , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[102]  Joseph B. Evans,et al.  FPGA IMPLEMENTATION OF DIGITAL FILTERS , 1997 .

[103]  Dzung T. Hoang,et al.  Searching genetic databases on Splash 2 , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[104]  Viktor K. Prasanna,et al.  Fast parallel implementation of DFT using configurable devices , 1997, FPL.

[105]  Amir Pnueli,et al.  A fast algorithm for scheduling time-constrained instructions on processors with ILP , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[106]  Sumedh W. Sathaye,et al.  A fast interrupt handling scheme for VLIW processors , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[107]  Barry S. Fagin,et al.  Field programmable gate arrays and floating point arithmetic , 1994, IEEE Trans. Very Large Scale Integr. Syst..

[108]  John L. Hennessy,et al.  Register allocation by priority-based coloring , 1984, SIGPLAN '84.

[109]  B. Ramakrishna Rau The era of embedded computing , 2000, CASES '00.

[110]  Brent E. Nelson,et al.  Genetic algorithms in software and in hardware-a performance analysis of workstation and custom computing machine implementations , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[111]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[112]  Milos D. Ercegovac,et al.  On digit-recurrence division implementations for field programmable gate arrays , 1993, Proceedings of IEEE 11th Symposium on Computer Arithmetic.

[113]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[114]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[115]  Michael J. Wirthlin,et al.  DISC: the dynamic instruction set computer , 1995, Optics East.

[116]  Michael D. Smith,et al.  Limits on multiple instruction issue , 1989, ASPLOS 1989.

[117]  Brad L. Hutchings,et al.  An Assessment of the Suitability of FPGA-Based Systems for Use in Digital Signal Processing , 1995, FPL.

[118]  Mark Shand,et al.  Fast implementations of RSA cryptography , 1993, Proceedings of IEEE 11th Symposium on Computer Arithmetic.

[119]  Eduardo Sanchez,et al.  Spyder: a reconfigurable VLIW processor using FPGAs , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[120]  Tom Kean,et al.  A Fast Constant Coefficient Multiplier for the XC6200 , 1996, FPL.

[121]  Mikko H. Lipasti Value locality and speculative execution , 1998 .

[122]  Jean Vuillemin,et al.  Introduction to programmable active memories , 1990 .

[123]  Brad L. Hutchings,et al.  Automated target recognition on SPLASH 2 , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[124]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[125]  Frederick Furtek A field-programmable gate array for systolic computing , 1993 .

[126]  Eric Lemoine,et al.  Run time reconfiguration of FPGA for scanning genomic databases , 1995, Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[127]  Krishna V. Palem End-to-end solutions for reconfigurable systems: the programming gap and challenges , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.

[128]  Thomas R. Gross,et al.  Code generation and reorganization in the presence of pipeline constraints , 1982, POPL '82.

[129]  Harvey F. Silverman,et al.  Processor reconfiguration through instruction-set metamorphosis , 1993, Computer.

[130]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[131]  E.L. Lawler,et al.  Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[132]  A. Lynn Abbott,et al.  Implementation of a 2-D Fast Fourier Transform on an FPGA-Based Custom Computing Machine , 1995, FPL.

[133]  Jack E. Volder The CORDIC Trigonometric Computing Technique , 1959, IRE Trans. Electron. Comput..

[134]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.

[135]  Alvin M. Despain,et al.  Viewing instruction set design as an optimization problem , 1991, MICRO 24.

[136]  Mohamed Akil,et al.  Low level image processing operators on FPGA: implementation examples and performance evaluation , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 2 - Conference B: Computer Vision & Image Processing. (Cat. No.94CH3440-5).

[137]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[138]  Carl Ebeling,et al.  RaPiD - Reconfigurable Pipelined Datapath , 1996, FPL.

[139]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[140]  Frederick M. Haney,et al.  ISDS: a program that designs computer instruction sets , 1969, AFIPS '69 (Fall).

[141]  Tsuyoshi Isshiki,et al.  High-Level Bit-Serial Datapath Synthesis for Multi-FPGA Systems , 1995, Third International ACM Symposium on Field-Programmable Gate Arrays.