SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.

[1]  Nachiket Kapre,et al.  Packet Switched vs. Time Multiplexed FPGA Overlay Networks , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[2]  Martin Langhammer Floating point datapath synthesis for FPGAs , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[3]  David M. Lewis,et al.  A compiled-code hardware accelerator for circuit simulation , 1992, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[4]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[5]  R.W. Dutton,et al.  Impact of Scaling on Analog Performance and Associated Modeling Needs , 2006, IEEE Transactions on Electron Devices.

[6]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[7]  Jennifer A. Scott,et al.  Stabilized bordered block diagonal forms for parallel sparse solvers , 2005, Parallel Comput..

[8]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[9]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[10]  Florent de Dinechin,et al.  When FPGAs are better at floating-point than microprocessors , 2008, FPGA '08.

[11]  M. C. Jeng,et al.  A robust physical and predictive model for deep-submicrometer MOS circuit simulation , 1993, Proceedings of IEEE Custom Integrated Circuits Conference - CICC '93.

[12]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[13]  John Wawrzynek,et al.  Stochastic, spatial routing for hypergraphs, trees, and meshes , 2003, FPGA '03.

[14]  P. Sadayappan,et al.  Parallelization and performance evaluation of circuit simulation on a shared-memory multiprocessor , 1988, ICS '88.

[15]  L. Peterson,et al.  The design and implementation of a concurrent circuit simulation program for multicomputers , 1993, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[16]  Katherine Yelick,et al.  Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply , 2004 .

[17]  Bradford Nichols,et al.  Pthreads programming - a POSIX standard for better multiprocessing , 1996 .

[18]  Wei Dong,et al.  WavePipe: Parallel transient simulation of analog and digital circuits on multi-core shared-memory machines , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[19]  Alex Pothen,et al.  Computing the block triangular form of a sparse matrix , 1990, TOMS.

[20]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[21]  Rajit Manohar,et al.  DATAFLOW NETWORKS FOR EVENT STREAM PROCESSING , 2004 .

[22]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[23]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[24]  Teresa H. Y. Meng,et al.  Towards program optimization through automated analysis of numerical precision , 2010, CGO '10.

[25]  Timothy A. Davis,et al.  A column approximate minimum degree ordering algorithm , 2000, TOMS.

[26]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[27]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[28]  Chung-Kuan Cheng,et al.  Parallel transistor level circuit simulation using domain decomposition methods , 2009, 2009 Asia and South Pacific Design Automation Conference.

[29]  Robert A. van de Geijn,et al.  SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[30]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[31]  André DeHon,et al.  The Density Advantage of Configurable Computing , 2000, Computer.

[32]  A. Richard Newton,et al.  Analysis of performance and convergence issues for circuit simulation , 1989 .

[33]  Richard F. Barrett,et al.  Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[34]  Gerhard Wellein,et al.  Have the Vectors the Continuing Ability to Parry the Attack of the Killer Micros , 2006 .

[35]  李幼升,et al.  Ph , 1989 .

[36]  Eric R. Keiter,et al.  The Xyce Parallel Electronic Simulator - An Overview , 2000 .

[37]  John Wawrzynek,et al.  Research accelerator for multiple processors , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[38]  Joseph A. Fisher The VLIW Machine: A Multiprocessor for Compiling Scientific Code , 1984, Computer.

[39]  Sunil P. Khatri,et al.  Fast circuit simulation on graphics processing units , 2009, 2009 Asia and South Pacific Design Automation Conference.

[40]  George A. Constantinides,et al.  Automated Precision Analysis: A Polynomial Algebraic Approach , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[41]  L. Lemaitre,et al.  Extensions to Verilog-A to support compact device modeling , 2003, Proceedings of the 2003 IEEE International Workshop on Behavioral Modeling and Simulation.

[42]  Prawat Nagvajara,et al.  Sparse LU Decomposition using FPGA ⋆ , 2008 .

[43]  Zhao Li,et al.  An efficiently preconditioned GMRES method for fast parasitic-sensitive deep-submicron VLSI circuit simulation , 2005, Design, Automation and Test in Europe.

[44]  Guy Lemieux,et al.  Towards reliable 5Gbps wave-pipelined and 3Gbps surfing interconnect in 65nm FPGAs , 2009, FPGA '09.

[45]  John Wawrzynek,et al.  Design automation for streaming systems , 2005 .

[46]  Yoshitaka Maekawa,et al.  Near Fine Grain Parallel Processing of Circuit Simulation Using Direct Method , 1994 .

[47]  Reiji Suda,et al.  Implementation of sparta, a highly parallel circuit simulator by the preconditioned Jacobi method, on a distributed memory machine , 1995, ICS '95.

[48]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[49]  Ralph Wittig,et al.  Performance and power of cache-based reconfigurable computing , 2009, FPGA '09.

[50]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[51]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[52]  Gung-Chung Yang PARASPICE: a parallel circuit simulator for shared-memory multiprocessors , 1991, DAC '90.

[53]  Albert E. Ruehli,et al.  The modified nodal approach to network analysis , 1975 .

[54]  Barbara M. Chapman,et al.  OpenMP Implementation of SPICE3 Circuit Simulator , 2007, International Journal of Parallel Programming.

[55]  Yasser Y. Hanafy,et al.  Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures , 2008, IFMT '08.

[56]  Bo Wan,et al.  MCAST: an abstract-syntax-tree based model compiler for circuit simulation , 2003, Proceedings of the IEEE 2003 Custom Integrated Circuits Conference, 2003..

[57]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[58]  L. Higbie Optimal Parallel Triangulation of a Sparse Matrix , 1979 .

[59]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[60]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[61]  Robert W. Floyd The paradigms of programming , 2007 .

[62]  David Bryan,et al.  Combinational profiles of sequential benchmark circuits , 1989, IEEE International Symposium on Circuits and Systems,.

[63]  J. Gilbert,et al.  Sparse Partial Pivoting in Time Proportional to Arithmetic Operations , 1986 .

[64]  Martin Langhammer,et al.  FPGA Floating Point Datapath Compiler , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[65]  Nachiket Kapre,et al.  Accelerating SPICE Model-Evaluation using FPGAs , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[66]  G. Gildenblat,et al.  PSP: An Advanced Surface-Potential-Based MOSFET Model for Circuit Simulation , 2006, IEEE Transactions on Electron Devices.

[67]  André DeHon,et al.  Compact, multilayer layout for butterfly fat-tree , 2000, SPAA '00.

[68]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[69]  Ausif Mahmood,et al.  Parallel SOLVE for direct circuit simulation on a transputer array , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[70]  Prathima Agrawal,et al.  PACE: A Multiprocessor System for VLSI Circuit Simulation , 1993, PPSC.

[71]  David J. Frank,et al.  Power-constrained CMOS scaling limits , 2002, IBM J. Res. Dev..

[72]  Resve A. Saleh,et al.  Parallel waveform-Newton algorithms for circuit simulation , 1992, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[73]  Henk A. van der Vorst,et al.  A parallel linear system solver for circuit simulation problems , 2000, Numer. Linear Algebra Appl..

[74]  Heather M. Quinn,et al.  Vision for cross-layer optimization to address the dual challenges of energy and reliability , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[75]  吉野 智興,et al.  Programmer's guide , 1993 .

[76]  Fujio Yamamoto,et al.  Vectorized LU Decomposition Algorithms for Large-Scale Circuit Simulation , 1985, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[77]  Andrew B. Kahng,et al.  Improved algorithms for hypergraph bipartitioning , 2000, ASP-DAC '00.

[78]  Jeremy Johnson,et al.  Power flow computation using field programmable gate arrays , 2007 .

[79]  Marcus Van Ierssel Circuit Simulation on a Field Programmable Accelerator , 1995 .

[80]  Qiang Wang,et al.  Automated field-programmable compute accelerator design using partial evaluation , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[81]  Nachiket Kapre,et al.  Optimistic Parallelization of Floating-Point Accumulation , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[82]  Sotirios G. Ziavras,et al.  Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration , 2006 .

[83]  Youn-Long Lin,et al.  Recent developments in high-level synthesis , 1997, TODE.

[84]  Daniel D. Gajski,et al.  High ― Level Synthesis: Introduction to Chip and System Design , 1992 .

[85]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[86]  Sotirios G. Ziavras,et al.  Parallel LU factorization of sparse matrices on FPGA-based configurable computing engines: Research Articles , 2004 .

[87]  Mansun Chan,et al.  The engineering of BSIM for the nano-technology era and beyond , 2002 .

[88]  Ekanathan Palamadai Natarajan,et al.  KLU{A HIGH PERFORMANCE SPARSE LINEAR SOLVER FOR CIRCUIT SIMULATION PROBLEMS , 2005 .

[89]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[90]  Paul Chow,et al.  Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2000, Monterey, CA, USA, February 10-11, 2000 , 2000, FPGA.

[91]  Srinivas Devadas,et al.  Algorithms for hardware allocation in data path synthesis , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[92]  Goichi Yokomizo,et al.  A parallel and accelerated circuit simulator with precise accuracy , 2002, Proceedings of ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h International Conference on VLSI Design.

[93]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[94]  Sani R. Nassif,et al.  MAPS: multi-algorithm parallel circuit simulation , 2008, ICCAD 2008.

[95]  Zhao Li,et al.  SILCA: SPICE-accurate iterative linear-centric analysis for efficient time-domain Simulation of VLSI circuits with strong parasitic couplings , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[96]  André DeHon,et al.  Hardware-assisted simulated annealing with application for fast FPGA placement , 2003, FPGA '03.

[97]  Timothy A. Davis,et al.  Algorithm 907 , 2010 .

[98]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[99]  Kenneth B. Kent,et al.  VPR 5.0: FPGA CAD and architecture exploration tools with single-driver routing, heterogeneity and process scaling , 2011, TRETS.

[100]  Anant Agarwal,et al.  Logic emulation with virtual wires , 1997, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[101]  Roy L. Russo,et al.  On a Pin Versus Block Relationship For Partitions of Logic Graphs , 1971, IEEE Transactions on Computers.

[102]  Gi-Joon Nam,et al.  Ispd2009 clock network synthesis contest , 2009, ISPD '09.

[103]  David M. Lewis A programmable hardware accelerator for compiled electrical simulation , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[104]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[105]  Jennifer A. Scott,et al.  A parallel direct solver for large sparse highly unsymmetric linear systems , 2004, TOMS.

[106]  Nikil Mehta,et al.  Time-Multiplexed FPGA Overlay Networks on Chip , 2006 .

[107]  Saurabh Dighe,et al.  The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[108]  H. Diab,et al.  An FPGA-based MOS circuit simulator , 2005, 48th Midwest Symposium on Circuits and Systems, 2005..

[109]  Andrei Vladimirescu,et al.  A Vector Hardware Accelerator with Circuit Simulation Emphasis , 1987, 24th ACM/IEEE Design Automation Conference.

[110]  Christoforos E. Kozyrakis,et al.  RAMP: Research Accelerator for Multiple Processors , 2007, IEEE Micro.