Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE Model-Evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3− 182× for a Xilinx Virtex5 LX 330T, 1.3−33× for an IBM Cell, and 3−131× for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of singleprecision device models.

[1]  Yasser Y. Hanafy,et al.  Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures , 2008, IFMT '08.

[2]  Bo Wan,et al.  MCAST: an abstract-syntax-tree based model compiler for circuit simulation , 2003, Proceedings of the IEEE 2003 Custom Integrated Circuits Conference, 2003..

[3]  Sunil P. Khatri,et al.  Fast circuit simulation on graphics processing units , 2009, 2009 Asia and South Pacific Design Automation Conference.

[4]  Florent de Dinechin,et al.  Parameterized floating-point logarithm and exponential functions for FPGAs , 2007, Microprocess. Microsystems.

[5]  L. Lemaitre,et al.  Extensions to Verilog-A to support compact device modeling , 2003, Proceedings of the 2003 IEEE International Workshop on Behavioral Modeling and Simulation.

[6]  Nachiket Kapre,et al.  Packet Switched vs. Time Multiplexed FPGA Overlay Networks , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[7]  Bradford Nichols,et al.  Pthreads programming , 1996 .

[8]  Nikil Mehta,et al.  Time-Multiplexed FPGA Overlay Networks on Chip , 2006 .

[9]  Srinivas Devadas,et al.  Algorithms for hardware allocation in data path synthesis , 1989, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[10]  Andrew B. Kahng,et al.  Improved algorithms for hypergraph bipartitioning , 2000, ASP-DAC '00.

[11]  G. Gildenblat,et al.  PSP: An Advanced Surface-Potential-Based MOSFET Model for Circuit Simulation , 2006, IEEE Transactions on Electron Devices.

[12]  Prawat Nagvajara,et al.  Sparse LU Decomposition using FPGA ⋆ , 2008 .

[13]  Youn-Long Lin,et al.  Recent developments in high-level synthesis , 1997, TODE.

[14]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[15]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[16]  Nachiket Kapre,et al.  Accelerating SPICE Model-Evaluation using FPGAs , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.