${\rm SPICE}^2$: Spatial Processors Interconnected for Concurrent Execution for Accelerating the SPICE Circuit Simulator Using an FPGA

Spatial processing of sparse, irregular, double-precision floating-point computation using a single field-programmable gate array (FPGA) enables up to an order of magnitude speedup (mean 2.8× speedup) over a conventional microprocessor for the SPICE circuit simulator. We develop a parallel, FPGA-based, heterogeneous architecture customized for accelerating the SPICE simulator to deliver this speedup. To properly parallelize the complete simulator, we decompose SPICE into its three constituent phases-model evaluation, sparse matrix-solve, and iteration control-and customize a spatial architecture for each phase independently. Our heterogeneous FPGA organization mixes very large instruction word, dataflow and streaming architectures into a cohesive, unified design to match the parallel patterns exposed by our programming framework. This FPGA architecture is able to outperform conventional processors due to a combination of factors, including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and streaming, overlapped processing of the control algorithms. We demonstrate that we can independently accelerate model evaluation by a mean factor of 6.5 × (1.4-23×) across a range of nonlinear device models and matrix solve by 2.4×(0.6-13×) across various benchmark matrices while delivering a mean combined speedup of 2.8×(0.2-11×) for the composite design when comparing a Xilinx Virtex-6 LX760 (40 nm) with an Intel Core i7 965 (45 nm). We also estimate mean energy savings of 8.9× (up to 40.9×) when comparing a Xilinx Virtex-6 LX760 with an Intel Core i7 965.

[1]  David Bryan,et al.  Combinational profiles of sequential benchmark circuits , 1989, IEEE International Symposium on Circuits and Systems,.

[2]  J. Gilbert,et al.  Sparse Partial Pivoting in Time Proportional to Arithmetic Operations , 1986 .

[3]  André DeHon,et al.  Compact, multilayer layout for butterfly fat-tree , 2000, SPAA '00.

[4]  Yasser Y. Hanafy,et al.  Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures , 2008, IFMT '08.

[5]  George A. Constantinides,et al.  Automated Precision Analysis: A Polynomial Algebraic Approach , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[6]  L. Lemaitre,et al.  Extensions to Verilog-A to support compact device modeling , 2003, Proceedings of the 2003 IEEE International Workshop on Behavioral Modeling and Simulation.

[7]  Guy Lemieux,et al.  Towards reliable 5Gbps wave-pipelined and 3Gbps surfing interconnect in 65nm FPGAs , 2009, FPGA '09.

[8]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[9]  Teresa H. Y. Meng,et al.  Towards program optimization through automated analysis of numerical precision , 2010, CGO '10.

[10]  Ralph Wittig,et al.  Performance and power of cache-based reconfigurable computing , 2009, ISCA '09.

[11]  Florent de Dinechin,et al.  When FPGAs are better at floating-point than microprocessors , 2008, FPGA '08.

[12]  Nachiket Kapre,et al.  SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator , 2011 .

[13]  Gi-Joon Nam,et al.  Ispd2009 clock network synthesis contest , 2009, ISPD '09.

[14]  Chung-Kuan Cheng,et al.  Parallel transistor level circuit simulation using domain decomposition methods , 2009, 2009 Asia and South Pacific Design Automation Conference.

[15]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[16]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[17]  Eric R. Keiter,et al.  The Xyce Parallel Electronic Simulator - An Overview , 2000 .

[18]  David M. Lewis A programmable hardware accelerator for compiled electrical simulation , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[19]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[20]  Nikil Mehta,et al.  Time-Multiplexed FPGA Overlay Networks on Chip , 2006 .

[21]  David M. Lewis,et al.  A compiled-code hardware accelerator for circuit simulation , 1992, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[22]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[23]  Nachiket Kapre,et al.  Accelerating SPICE Model-Evaluation using FPGAs , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[24]  Joseph A. Fisher The VLIW Machine: A Multiprocessor for Compiling Scientific Code , 1984, Computer.

[25]  John Wawrzynek,et al.  Design automation for streaming systems , 2005 .

[26]  Andrew B. Kahng,et al.  Improved algorithms for hypergraph bipartitioning , 2000, ASP-DAC '00.

[27]  Ieee Circuits,et al.  IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems information for authors , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[28]  Albert E. Ruehli,et al.  The modified nodal approach to network analysis , 1975 .

[29]  Sunil P. Khatri,et al.  Fast circuit simulation on graphics processing units , 2009, 2009 Asia and South Pacific Design Automation Conference.

[30]  Qiang Wang,et al.  Automated field-programmable compute accelerator design using partial evaluation , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[31]  Nachiket Kapre,et al.  Optimistic Parallelization of Floating-Point Accumulation , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[32]  O. Wing,et al.  Optimal parallel triangulation of a sparse matrix , 1979 .

[33]  Nachiket Kapre,et al.  Packet Switched vs. Time Multiplexed FPGA Overlay Networks , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[34]  A. DeHon,et al.  Parallelizing sparse Matrix Solve for SPICE circuit simulation using FPGAs , 2009, 2009 International Conference on Field-Programmable Technology.

[35]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[36]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[37]  Goichi Yokomizo,et al.  A parallel and accelerated circuit simulator with precise accuracy , 2002, Proceedings of ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h International Conference on VLSI Design.

[38]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[39]  Sani R. Nassif,et al.  MAPS: Multi-Algorithm Parallel circuit Simulation , 2008, 2008 IEEE/ACM International Conference on Computer-Aided Design.

[40]  Nachiket Kapre,et al.  Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[41]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[42]  Richard F. Barrett,et al.  Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[43]  Ekanathan Palamadai Natarajan,et al.  KLU{A HIGH PERFORMANCE SPARSE LINEAR SOLVER FOR CIRCUIT SIMULATION PROBLEMS , 2005 .

[44]  Stylianos Perissakis,et al.  Stream computations organized for reconfigurable execution , 2006, Microprocess. Microsystems.