论文信息 - ${\rm SPICE}^2$: Spatial Processors Interconnected for Concurrent Execution for Accelerating the SPICE Circuit Simulator Using an FPGA

${\rm SPICE}^2$: Spatial Processors Interconnected for Concurrent Execution for Accelerating the SPICE Circuit Simulator Using an FPGA

Spatial processing of sparse, irregular, double-precision floating-point computation using a single field-programmable gate array (FPGA) enables up to an order of magnitude speedup (mean 2.8× speedup) over a conventional microprocessor for the SPICE circuit simulator. We develop a parallel, FPGA-based, heterogeneous architecture customized for accelerating the SPICE simulator to deliver this speedup. To properly parallelize the complete simulator, we decompose SPICE into its three constituent phases-model evaluation, sparse matrix-solve, and iteration control-and customize a spatial architecture for each phase independently. Our heterogeneous FPGA organization mixes very large instruction word, dataflow and streaming architectures into a cohesive, unified design to match the parallel patterns exposed by our programming framework. This FPGA architecture is able to outperform conventional processors due to a combination of factors, including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and streaming, overlapped processing of the control algorithms. We demonstrate that we can independently accelerate model evaluation by a mean factor of 6.5 × (1.4-23×) across a range of nonlinear device models and matrix solve by 2.4×(0.6-13×) across various benchmark matrices while delivering a mean combined speedup of 2.8×(0.2-11×) for the composite design when comparing a Xilinx Virtex-6 LX760 (40 nm) with an Intel Core i7 965 (45 nm). We also estimate mean energy savings of 8.9× (up to 40.9×) when comparing a Xilinx Virtex-6 LX760 with an Intel Core i7 965.

Nachiket Kapre | André DeHon | A. DeHon | Nachiket Kapre

[1] David Bryan,et al. Combinational profiles of sequential benchmark circuits , 1989, IEEE International Symposium on Circuits and Systems,.

[2] J. Gilbert,et al. Sparse Partial Pivoting in Time Proportional to Arithmetic Operations , 1986 .

[3] André DeHon,et al. Compact, multilayer layout for butterfly fat-tree , 2000, SPAA '00.

[4] Yasser Y. Hanafy,et al. Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures , 2008, IFMT '08.

[5] George A. Constantinides,et al. Automated Precision Analysis: A Polynomial Algebraic Approach , 2010, 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines.

[6] L. Lemaitre,et al. Extensions to Verilog-A to support compact device modeling , 2003, Proceedings of the 2003 IEEE International Workshop on Behavioral Modeling and Simulation.

[7] Guy Lemieux,et al. Towards reliable 5Gbps wave-pipelined and 3Gbps surfing interconnect in 65nm FPGAs , 2009, FPGA '09.

[8] Sudhakar Yalamanchili,et al. Interconnection Networks: An Engineering Approach , 2002 .

[9] Teresa H. Y. Meng,et al. Towards program optimization through automated analysis of numerical precision , 2010, CGO '10.

[10] Ralph Wittig,et al. Performance and power of cache-based reconfigurable computing , 2009, ISCA '09.

[11] Florent de Dinechin,et al. When FPGAs are better at floating-point than microprocessors , 2008, FPGA '08.

[12] Nachiket Kapre,et al. SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator , 2011 .

[13] Gi-Joon Nam,et al. Ispd2009 clock network synthesis contest , 2009, ISPD '09.

[14] Chung-Kuan Cheng,et al. Parallel transistor level circuit simulation using domain decomposition methods , 2009, 2009 Asia and South Pacific Design Automation Conference.

[15] Nachiket Kapre,et al. GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[16] David A. Patterson,et al. Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[17] Eric R. Keiter,et al. The Xyce Parallel Electronic Simulator - An Overview , 2000 .

[18] David M. Lewis. A programmable hardware accelerator for compiled electrical simulation , 1988, 25th ACM/IEEE, Design Automation Conference.Proceedings 1988..

[19] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[20] Nikil Mehta,et al. Time-Multiplexed FPGA Overlay Networks on Chip , 2006 .

[21] David M. Lewis,et al. A compiled-code hardware accelerator for circuit simulation , 1992, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[22] George Ho,et al. PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[23] Nachiket Kapre,et al. Accelerating SPICE Model-Evaluation using FPGAs , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[24] Joseph A. Fisher. The VLIW Machine: A Multiprocessor for Compiling Scientific Code , 1984, Computer.

[25] John Wawrzynek,et al. Design automation for streaming systems , 2005 .

[26] Andrew B. Kahng,et al. Improved algorithms for hypergraph bipartitioning , 2000, ASP-DAC '00.

[27] Ieee Circuits,et al. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems information for authors , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[28] Albert E. Ruehli,et al. The modified nodal approach to network analysis , 1975 .

[29] Sunil P. Khatri,et al. Fast circuit simulation on graphics processing units , 2009, 2009 Asia and South Pacific Design Automation Conference.

[30] Qiang Wang,et al. Automated field-programmable compute accelerator design using partial evaluation , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[31] Nachiket Kapre,et al. Optimistic Parallelization of Floating-Point Accumulation , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[32] O. Wing,et al. Optimal parallel triangulation of a sparse matrix , 1979 .

[33] Nachiket Kapre,et al. Packet Switched vs. Time Multiplexed FPGA Overlay Networks , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[34] A. DeHon,et al. Parallelizing sparse Matrix Solve for SPICE circuit simulation using FPGAs , 2009, 2009 International Conference on Field-Programmable Technology.

[35] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .

[36] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[37] Goichi Yokomizo,et al. A parallel and accelerated circuit simulator with precise accuracy , 2002, Proceedings of ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h International Conference on VLSI Design.

[38] David A. Patterson,et al. Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[39] Sani R. Nassif,et al. MAPS: Multi-Algorithm Parallel circuit Simulation , 2008, 2008 IEEE/ACM International Conference on Computer-Aided Design.

[40] Nachiket Kapre,et al. Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[41] David E. Culler,et al. Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[42] Richard F. Barrett,et al. Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[43] Ekanathan Palamadai Natarajan,et al. KLU{A HIGH PERFORMANCE SPARSE LINEAR SOLVER FOR CIRCUIT SIMULATION PROBLEMS , 2005 .

[44] Stylianos Perissakis,et al. Stream computations organized for reconfigurable execution , 2006, Microprocess. Microsystems.