GUSTO : general architecture design utility and synthesis tool for optimization

Matrix computations lie at the heart of many scientific computational algorithms including signal processing, computer vision and financial computations. Since matrix computation algorithms are expensive computational tasks, hardware implementations of these algorithms requires substantial time and effort. There is an increasing demand for a domain specific tool for matrix computation algorithms which provides fast and highly efficient hardware production. This thesis presents GUSTO, a novel hardware design tool that provides a push-button transition from high level specification for matrix computation algorithms to hardware description language. GUSTO employs a novel top-to-bottom design methodology to generate correct-by-construction and cycle-accurate application specific architectures. The top-to-bottom design methodology provides simplicity (through the use of a simple tool chain and programming model), flexibility (through the use of different languages, e.g. C/MATLAB, as a high level specification and different parameterization options), scalability (through the ability to handle complex algorithms) and performance (through the use of our novel trimming optimization using a simulate & eliminate method providing results that are similar to these in commercial tools). Although matrix computations are inherently parallel, the algorithms and commercial software tools to exploit parallel processing are still in their infancy. Therefore, GUSTO also provides the ability to divide the given matrix computation algorithms into smaller processing elements providing architectures that are small in area and highly optimized for throughput. These processing elements are then instantiated with hierarchical datapaths in a multi-core fashion. The different design methods and parameterization options that are provided by GUSTO enable the user to study area and performance tradeoffs over a large number of different architectures and find the optimum architecture for the desired objective. GUSTO provides the ability to prototype hardware systems in minutes rather than days or weeks.

[1]  Wolfgang Rosenstiel,et al.  Automatic module allocation in high level synthesis , 1992, Proceedings EURO-DAC '92: European Design Automation Conference.

[2]  K. Feldman Portfolio Selection, Efficient Diversification of Investments . By Harry M. Markowitz (Basil Blackwell, 1991) £25.00 , 1992 .

[3]  Michael J. Flynn,et al.  PAM-Blox: high performance FPGA design for adaptive computing , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[4]  Gerhard Bauch,et al.  Efficient Tomlinson-Harashima precoding for spatial multiplexing on flat MIMO channel , 2005, IEEE International Conference on Communications, 2005. ICC 2005. 2005.

[5]  Bernt Brodtkorb,et al.  A MATLAB Interface to the GPU , 2007 .

[6]  F. Fabozzi Handbook of Portfolio Management , 1998 .

[7]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[8]  Wayne Luk,et al.  Exploring reconfigurable architectures for explicit finite difference option pricing models , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[9]  Viktor K. Prasanna,et al.  Area and time efficient implementations of matrix multiplication on FPGAs , 2002, 2002 IEEE International Conference on Field-Programmable Technology, 2002. (FPT). Proceedings..

[10]  Wayne Luk,et al.  A Reconfigurable Platform for Real-Time Embedded Video Image Processing , 2003, FPL.

[11]  Kevin Skadron,et al.  Accelerating Compute-Intensive Applications with GPUs and FPGAs , 2008, 2008 Symposium on Application Specific Processors.

[12]  Ryan Kastner,et al.  GEN05-4: Carrier Offset and Channel Estimation for Cooperative MIMO Sensor Networks , 2006, IEEE Globecom 2006.

[13]  Tom VanCourt,et al.  FPGA acceleration of quasi-Monte Carlo in finance , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[14]  Gustavo de Veciana,et al.  Exploring performance tradeoffs for clustered VLIW ASIPs , 2000, ICCAD.

[15]  Santa Barbara,et al.  Implementation of QR Decomposition Algorithm using FPGAs , 2007 .

[16]  Michael R. Butts,et al.  A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing , 2007 .

[17]  Scott Mahlke,et al.  Automatic synthesis of customized local memories for multicluster application accelerators , 2004 .

[18]  Philippe Colantoni,et al.  Fast and Accurate Color Images Processing Using 3D Graphics Cards , 2003, VMV.

[19]  A. Meucci Risk and asset allocation , 2005 .

[20]  W. Nicholson Microeconomic theory: basic principles and extensions , 1972 .

[21]  Maya Gokhale,et al.  Stream-oriented FPGA computing in the Streams-C high level language , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[22]  Ryan Kastner,et al.  Design space exploration of a cooperative MIMO receiver for reconfigurable architectures , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[23]  Ryan Kastner,et al.  Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems , 2009, 2009 IEEE Wireless Communications and Networking Conference.

[24]  Marc Snir,et al.  Automatic tuning matrix multiplication performance on graphics hardware , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[25]  Wayne Luk,et al.  Reconfigurable acceleration for Monte Carlo based financial simulation , 2005, Proceedings. 2005 IEEE International Conference on Field-Programmable Technology, 2005..

[26]  Yu-Chin Hsu,et al.  STAR: An automatic data path allocator , 1992, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[27]  An Efficient FPGA Implementation of Scalable Matrix Inversion Core using QR Decomposition , 2007 .

[28]  Ryan Kastner,et al.  Survey of hardware platforms for an energy efficient implementation of matching pursuits algorithm for shallow water networks , 2008, WuWNeT '08.

[29]  Scott A. Mahlke,et al.  A distributed control path architecture for VLIW processors , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[30]  C. Dick,et al.  Real-Time QRD-Based Beamforming on an FPGA Platform , 2006, 2006 Fortieth Asilomar Conference on Signals, Systems and Computers.

[31]  Ling Qiu,et al.  A novel adaptive equalization algorithm for MIMO communication system , 2005, VTC-2005-Fall. 2005 IEEE 62nd Vehicular Technology Conference, 2005..

[32]  Tadashi Matsumoto,et al.  A MIMO turbo equalizer for frequency-selective channels with unknown interference , 2003, IEEE Trans. Veh. Technol..

[33]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[34]  Ryan Kastner,et al.  An FPGA Design Space Exploration Tool for Matrix Inversion Architectures , 2008, 2008 Symposium on Application Specific Processors.

[35]  Srinivas Devadas,et al.  Algorithms for hardware allocation in data path synthesis , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[36]  Scott A. Mahlke,et al.  Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[37]  Ryan Kastner,et al.  Implementation of the Alamouti OSTBC to a Distributed Set of Single-Antenna Wireless Nodes , 2007, 2007 IEEE Wireless Communications and Networking Conference.

[38]  Abbes Amira,et al.  Design and Efficient FPGA Implementation of an RGB to YCrCb Color Space Converter Using Distributed Arithmetic , 2004, FPL.

[39]  Joseph A. Fisher,et al.  Very Long Instruction Word architectures and the ELI-512 , 1983, ISCA '83.

[40]  Mike Butts,et al.  Synchronization through Communication in a Massively Parallel Processor Array , 2007, IEEE Micro.

[41]  Thomas Bollaert Catapult Synthesis: A Practical Introduction to Interactive C Synthesis , 2008 .

[42]  Heinrich Theodor Vierhaus,et al.  Generating reliable embedded processors , 1998, IEEE Micro.

[43]  Abbes Amira,et al.  Accelerating Matrix Product on Reconfigurable Hardware for Signal Processing , 2001, FPL.

[44]  Patrick Schaumont,et al.  A new algorithm for elimination of common subexpressions , 1999, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[45]  Wayne Luk,et al.  Sampling from the Multivariate Gaussian Distribution using Reconfigurable Hardware , 2007 .

[46]  Peter Marwedel,et al.  OSCAR: optimum simultaneous scheduling, allocation and resource binding based on integer programming , 1994, EURO-DAC '94.

[47]  Wayne Luk,et al.  Customising graphics applications: techniques and programming interface , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[48]  Ryan Kastner,et al.  Automatic generation of decomposition based matrix inversion architectures , 2008, 2008 International Conference on Field-Programmable Technology.

[49]  Scott A. Mahlke,et al.  Systematic register bypass customization for application-specific processors , 2003, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003.

[50]  Ryan Kastner,et al.  Optimizing High Speed Arithmetic Circuits Using Three-Term Extraction , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[51]  Yan Meng,et al.  MP core: algorithm and design techniques for efficient channel estimation in wireless applications , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[52]  Wayne Luk,et al.  Credit Risk Modelling using Hardware Accelerated Monte-Carlo Simulation , 2008, 2008 16th International Symposium on Field-Programmable Custom Computing Machines.

[53]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[54]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[55]  Keshab K. Parhi,et al.  Annihilation-reordering look-ahead pipelined CORDIC-based RLS adaptive filters and their application to adaptive beamforming , 2000, IEEE Trans. Signal Process..

[56]  Jason Cong,et al.  FPGA-Based Hardware Acceleration of Lithographic Aerial Image Simulation , 2009, TRETS.

[57]  Viktor Öwall,et al.  A scalable pipelined complex valued matrix inversion architecture , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[58]  Miguel O. Arias-Estrada,et al.  FPGA Processor for Real-Time Optical Flow Computation , 2003, FPL.

[59]  Ryan Kastner,et al.  Energy benefits of reconfigurable hardware for use in underwater snesor nets , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[60]  Matthew Aubury,et al.  Design Space Exploration of the European Option Benchmark using Hyperstreams , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[61]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[62]  Mike Butts,et al.  TeraOPS hardware: A new massively-parallel MIMD computing fabric IC , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[63]  Wayne Luk,et al.  Video Image Processing with the Sonic Architecture , 2000, Computer.

[64]  Scott A. Mahlke,et al.  Streamroller:: automatic synthesis of prescribed throughput accelerator pipelines , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[65]  H. Markowitz Portfolio Selection: Efficient Diversification of Investments , 1971 .

[66]  Ryan Kastner,et al.  GUSTO: An automatic generation and optimization tool for matrix inversion architectures , 2010, TECS.

[67]  Viktor K. Prasanna,et al.  On Synthesizing Optimal Family of Linear Systolic Arrays for Matrix Multiplication , 1991, IEEE Trans. Computers.

[68]  Robert Strzodka,et al.  Real-time motion estimation and visualization on graphics cards , 2004, IEEE Visualization 2004.

[69]  Michael D. McCool,et al.  Programming using RapidMind on the Cell BE , 2006, SC.

[70]  Annamaria Sorato,et al.  The Pearson system of utility functions , 2006, Eur. J. Oper. Res..

[71]  Sean Gallagher,et al.  CASE-STUDY OF A XILINX SYSTEM GENERATOR DESIGN FLOW FOR RAPID DEVELOPMENT OF SDR WAVEFORMS , 2005 .

[72]  Wayne Luk,et al.  A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation , 2009, FPGA '09.

[73]  Majid Ahmadi,et al.  A New Algorithm for the Elimination of Common Subexpressions in Hardware Implementation of Digital Filters by Using Genetic Programming , 2002, J. VLSI Signal Process..

[74]  Stephen P. Boyd,et al.  Applications of second-order cone programming , 1998 .

[75]  Christopher C. Paige,et al.  Loss and Recapture of Orthogonality in the Modified Gram-Schmidt Algorithm , 1992, SIAM J. Matrix Anal. Appl..

[76]  Joseph R. Cavallaro,et al.  FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm , 2005, Conference Record of the Thirty-Ninth Asilomar Conference onSignals, Systems and Computers, 2005..

[77]  A. Avizeinis,et al.  Signed Digit Number Representations for Fast Parallel Arithmetic , 1961 .

[78]  Robert Strzodka,et al.  Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations , 2007, Int. J. Parallel Emergent Distributed Syst..

[79]  Alexander M. Haimovich,et al.  Layered turbo space-time coded MIMO-OFDM systems for time varying channels , 2003, GLOBECOM '03. IEEE Global Telecommunications Conference (IEEE Cat. No.03CH37489).

[80]  Pierre G. Paulin,et al.  Force-directed scheduling for the behavioral synthesis of ASICs , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[81]  Kazutoshi Wakabayashi,et al.  C-based synthesis experiences with a behavior synthesizer, "Cyber" , 1999, Design, Automation and Test in Europe Conference and Exhibition, 1999. Proceedings (Cat. No. PR00078).

[82]  B. Ramakrishna Rau,et al.  PICO: Automatically Designing Custom Computers , 2002, Computer.

[84]  Johan Eilert,et al.  Efficient Complex Matrix Inversion for MIMO Software Defined Radio , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[85]  Daniel Gajski,et al.  Chippe: a system for constraint driven behavioral synthesis , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[86]  M. Analoui,et al.  Automatic Generation and Optimisation of Reconfigurable Financial Monte-Carlo Simulations , 2007, 2007 IEEE International Conf. on Application-specific Systems, Architectures and Processors (ASAP).

[87]  Tadashi Matsumoto,et al.  Space-time turbo equalization in frequency-selective MIMO channels , 2003, IEEE Trans. Veh. Technol..

[88]  Å. Björck Numerics of Gram-Schmidt orthogonalization , 1994 .

[89]  Paul Chow,et al.  FPGA acceleration of Monte-Carlo based credit derivative pricing , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[90]  Ryan Kastner,et al.  Common subexpression elimination involving multiple variables linear DSP synthesis , 2004 .

[91]  J. Hull Options, Futures, and Other Derivatives , 1989 .

[92]  Scott A. Mahlke,et al.  Region-based hierarchical operation partitioning for multicluster processors , 2003, PLDI '03.

[93]  Poras T. Balsara,et al.  VLSI Architecture for Matrix Inversion using Modified Gram-Schmidt based QR Decomposition , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[94]  Daniel P. Siewiorek,et al.  Automated Synthesis of Data Paths in Digital Systems , 1986, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[95]  R. El-Atfy,et al.  Accelerating Matrix Multiplication on FPGAs , 2007, 2007 2nd International Design and Test Workshop.

[96]  Maya Gokhale,et al.  NAPA C: compiling for a hybrid RISC/FPGA architecture , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[97]  Daniel Gajski,et al.  Custom Processor Core Construction from C Code , 2008, 2008 Symposium on Application Specific Processors.

[98]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[99]  Darin Petkov,et al.  Automatic generation of application specific processors , 2003, CASES '03.

[100]  Ryan Kastner,et al.  Xquasher: A tool for efficient computation of multiple linear expressions , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[101]  Daniel Gajski,et al.  Automatic architecture refinement techniques for customizing processing elements , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[102]  R. Kastner,et al.  FPGA acceleration of mean variance framework for optimal asset allocation , 2008, 2008 Workshop on High Performance Computational Finance.

[103]  FPGA Implementation of Adaptive Weight Calculation Core Using QRD-RLS Algorithm A , .