论文信息 - SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.

Nachiket Kapre | Nachiket Kapre

[1] Nachiket Kapre,et al. Packet Switched vs. Time Multiplexed FPGA Overlay Networks , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[2] Martin Langhammer. Floating point datapath synthesis for FPGAs , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[3] David M. Lewis,et al. A compiled-code hardware accelerator for circuit simulation , 1992, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[4] George Ho,et al. PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[5] R.W. Dutton,et al. Impact of Scaling on Analog Performance and Associated Modeling Needs , 2006, IEEE Transactions on Electron Devices.

[6] Monica S. Lam,et al. RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[7] Jennifer A. Scott,et al. Stabilized bordered block diagonal forms for parallel sparse solvers , 2005, Parallel Comput..

[8] B. Ramakrishna Rau,et al. Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[9] Samuel Williams,et al. The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[10] Florent de Dinechin,et al. When FPGAs are better at floating-point than microprocessors , 2008, FPGA '08.

[11] M. C. Jeng,et al. A robust physical and predictive model for deep-submicrometer MOS circuit simulation , 1993, Proceedings of IEEE Custom Integrated Circuits Conference - CICC '93.

[12] James Demmel,et al. the Parallel Computing Landscape , 2022 .

[13] John Wawrzynek,et al. Stochastic, spatial routing for hypergraphs, trees, and meshes , 2003, FPGA '03.

[14] P. Sadayappan,et al. Parallelization and performance evaluation of circuit simulation on a shared-memory multiprocessor , 1988, ICS '88.

[15] L. Peterson,et al. The design and implementation of a concurrent circuit simulation program for multicomputers , 1993, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[16] Katherine Yelick,et al. Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply , 2004 .

[17] Bradford Nichols,et al. Pthreads programming - a POSIX standard for better multiprocessing , 1996 .

[18] Wei Dong,et al. WavePipe: Parallel transient simulation of analog and digital circuits on multi-core shared-memory machines , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[19] Alex Pothen,et al. Computing the block triangular form of a sparse matrix , 1990, TOMS.

[20] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[21] Rajit Manohar,et al. DATAFLOW NETWORKS FOR EVENT STREAM PROCESSING , 2004 .

[22] R. M. Tomasulo,et al. An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[23] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[24] Teresa H. Y. Meng,et al. Towards program optimization through automated analysis of numerical precision , 2010, CGO '10.

[25] Timothy A. Davis,et al. A column approximate minimum degree ordering algorithm , 2000, TOMS.

[26] B. Ramakrishna Rau,et al. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[27] John R. Ellis,et al. Bulldog: A Compiler for VLIW Architectures , 1986 .

[28] Chung-Kuan Cheng,et al. Parallel transistor level circuit simulation using domain decomposition methods , 2009, 2009 Asia and South Pacific Design Automation Conference.

[29] Robert A. van de Geijn,et al. SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[30] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .

[31] André DeHon,et al. The Density Advantage of Configurable Computing , 2000, Computer.

[32] A. Richard Newton,et al. Analysis of performance and convergence issues for circuit simulation , 1989 .

[33] Richard F. Barrett,et al. Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[34] Gerhard Wellein,et al. Have the Vectors the Continuing Ability to Parry the Attack of the Killer Micros , 2006 .

[35] 李幼升,et al. Ph , 1989 .

[36] Eric R. Keiter,et al. The Xyce Parallel Electronic Simulator - An Overview , 2000 .

[37] John Wawrzynek,et al. Research accelerator for multiple processors , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[38] Joseph A. Fisher. The VLIW Machine: A Multiprocessor for Compiling Scientific Code , 1984, Computer.

[39] Sunil P. Khatri,et al. Fast circuit simulation on graphics processing units , 2009, 2009 Asia and South Pacific Design Automation Conference.

[40] George A. Constantinides,et al. Automated Precision Analysis: A Polynomial Algebraic Approach , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[41] L. Lemaitre,et al. Extensions to Verilog-A to support compact device modeling , 2003, Proceedings of the 2003 IEEE International Workshop on Behavioral Modeling and Simulation.

[42] Prawat Nagvajara,et al. Sparse LU Decomposition using FPGA ⋆ , 2008 .

[43] Zhao Li,et al. An efficiently preconditioned GMRES method for fast parasitic-sensitive deep-submicron VLSI circuit simulation , 2005, Design, Automation and Test in Europe.

[44] Guy Lemieux,et al. Towards reliable 5Gbps wave-pipelined and 3Gbps surfing interconnect in 65nm FPGAs , 2009, FPGA '09.

[45] John Wawrzynek,et al. Design automation for streaming systems , 2005 .

[46] Yoshitaka Maekawa,et al. Near Fine Grain Parallel Processing of Circuit Simulation Using Direct Method , 1994 .

[47] Reiji Suda,et al. Implementation of sparta, a highly parallel circuit simulator by the preconditioned Jacobi method, on a distributed memory machine , 1995, ICS '95.

[48] David A. Patterson,et al. Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[49] Ralph Wittig,et al. Performance and power of cache-based reconfigurable computing , 2009, FPGA '09.

[50] David E. Culler,et al. Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[51] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[52] Gung-Chung Yang. PARASPICE: a parallel circuit simulator for shared-memory multiprocessors , 1991, DAC '90.

[53] Albert E. Ruehli,et al. The modified nodal approach to network analysis , 1975 .

[54] Barbara M. Chapman,et al. OpenMP Implementation of SPICE3 Circuit Simulator , 2007, International Journal of Parallel Programming.

[55] Yasser Y. Hanafy,et al. Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures , 2008, IFMT '08.

[56] Bo Wan,et al. MCAST: an abstract-syntax-tree based model compiler for circuit simulation , 2003, Proceedings of the IEEE 2003 Custom Integrated Circuits Conference, 2003..

[57] William Gropp,et al. Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[58] L. Higbie. Optimal Parallel Triangulation of a Sparse Matrix , 1979 .

[59] Michael Garland,et al. Eﬃcient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[60] Nachiket Kapre,et al. GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[61] Robert W. Floyd. The paradigms of programming , 2007 .

[62] David Bryan,et al. Combinational profiles of sequential benchmark circuits , 1989, IEEE International Symposium on Circuits and Systems,.

[63] J. Gilbert,et al. Sparse Partial Pivoting in Time Proportional to Arithmetic Operations , 1986 .

[64] Martin Langhammer,et al. FPGA Floating Point Datapath Compiler , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[65] Nachiket Kapre,et al. Accelerating SPICE Model-Evaluation using FPGAs , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[66] G. Gildenblat,et al. PSP: An Advanced Surface-Potential-Based MOSFET Model for Circuit Simulation , 2006, IEEE Transactions on Electron Devices.

[67] André DeHon,et al. Compact, multilayer layout for butterfly fat-tree , 2000, SPAA '00.

[68] C. A. R. Hoare,et al. Communicating sequential processes , 1978, CACM.

[69] Ausif Mahmood,et al. Parallel SOLVE for direct circuit simulation on a transputer array , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[70] Prathima Agrawal,et al. PACE: A Multiprocessor System for VLSI Circuit Simulation , 1993, PPSC.

[71] David J. Frank,et al. Power-constrained CMOS scaling limits , 2002, IBM J. Res. Dev..

[72] Resve A. Saleh,et al. Parallel waveform-Newton algorithms for circuit simulation , 1992, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[73] Henk A. van der Vorst,et al. A parallel linear system solver for circuit simulation problems , 2000, Numer. Linear Algebra Appl..

[74] Heather M. Quinn,et al. Vision for cross-layer optimization to address the dual challenges of energy and reliability , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[75] 吉野智興,et al. Programmer's guide , 1993 .

[76] Fujio Yamamoto,et al. Vectorized LU Decomposition Algorithms for Large-Scale Circuit Simulation , 1985, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[77] Andrew B. Kahng,et al. Improved algorithms for hypergraph bipartitioning , 2000, ASP-DAC '00.

[78] Jeremy Johnson,et al. Power flow computation using field programmable gate arrays , 2007 .

[79] Marcus Van Ierssel. Circuit Simulation on a Field Programmable Accelerator , 1995 .

[80] Qiang Wang,et al. Automated field-programmable compute accelerator design using partial evaluation , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[81] Nachiket Kapre,et al. Optimistic Parallelization of Floating-Point Accumulation , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[82] Sotirios G. Ziavras,et al. Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration , 2006 .

[83] Youn-Long Lin,et al. Recent developments in high-level synthesis , 1997, TODE.

[84] Daniel D. Gajski,et al. High ― Level Synthesis: Introduction to Chip and System Design , 1992 .

[85] Sudhakar Yalamanchili,et al. Interconnection Networks: An Engineering Approach , 2002 .

[86] Sotirios G. Ziavras,et al. Parallel LU factorization of sparse matrices on FPGA-based configurable computing engines: Research Articles , 2004 .

[87] Mansun Chan,et al. The engineering of BSIM for the nano-technology era and beyond , 2002 .

[88] Ekanathan Palamadai Natarajan,et al. KLU{A HIGH PERFORMANCE SPARSE LINEAR SOLVER FOR CIRCUIT SIMULATION PROBLEMS , 2005 .

[89] André DeHon,et al. Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[90] Paul Chow,et al. Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2000, Monterey, CA, USA, February 10-11, 2000 , 2000, FPGA.

[91] Srinivas Devadas,et al. Algorithms for hardware allocation in data path synthesis , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[92] Goichi Yokomizo,et al. A parallel and accelerated circuit simulator with precise accuracy , 2002, Proceedings of ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h International Conference on VLSI Design.

[93] David A. Patterson,et al. Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[94] Sani R. Nassif,et al. MAPS: multi-algorithm parallel circuit simulation , 2008, ICCAD 2008.

[95] Zhao Li,et al. SILCA: SPICE-accurate iterative linear-centric analysis for efficient time-domain Simulation of VLSI circuits with strong parasitic couplings , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[96] André DeHon,et al. Hardware-assisted simulated annealing with application for fast FPGA placement , 2003, FPGA '03.

[97] Timothy A. Davis,et al. Algorithm 907 , 2010 .

[98] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[99] Kenneth B. Kent,et al. VPR 5.0: FPGA CAD and architecture exploration tools with single-driver routing, heterogeneity and process scaling , 2011, TRETS.

[100] Anant Agarwal,et al. Logic emulation with virtual wires , 1997, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[101] Roy L. Russo,et al. On a Pin Versus Block Relationship For Partitions of Logic Graphs , 1971, IEEE Transactions on Computers.

[102] Gi-Joon Nam,et al. Ispd2009 clock network synthesis contest , 2009, ISPD '09.

[103] David M. Lewis. A programmable hardware accelerator for compiled electrical simulation , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[104] Philipp Birken,et al. Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[105] Jennifer A. Scott,et al. A parallel direct solver for large sparse highly unsymmetric linear systems , 2004, TOMS.

[106] Nikil Mehta,et al. Time-Multiplexed FPGA Overlay Networks on Chip , 2006 .

[107] Saurabh Dighe,et al. The 48-core SCC Processor: the Programmer's View , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[108] H. Diab,et al. An FPGA-based MOS circuit simulator , 2005, 48th Midwest Symposium on Circuits and Systems, 2005..

[109] Andrei Vladimirescu,et al. A Vector Hardware Accelerator with Circuit Simulation Emphasis , 1987, 24th ACM/IEEE Design Automation Conference.

[110] Christoforos E. Kozyrakis,et al. RAMP: Research Accelerator for Multiple Processors , 2007, IEEE Micro.