论文信息 - Design and Power Performance Evaluation of On-Chip Memory Processor with Arithmetic Accelerators

Design and Power Performance Evaluation of On-Chip Memory Processor with Arithmetic Accelerators

In this paper, we design an on-chip memory processor with arithmetic accelerators, which are expected to improve power consumption. In addition, we evaluate the power performance of the processor. We propose implementing vector-type arithmetic accelerators and SIMD-type arithmetic accelerators in the on-chip memory processor. The evaluation results obtained using our simulator indicate that the performance of the 4FMAs SIMD-type accelerators is similar to that of the 4FMAs vector-type accelerators on DAXPY, Livermore kernel 1 and 3. However, the performance of the 4FMAs vector-type accelerator exceeds that of the 4FMAs SIMD-type accelerator with respect to matrix multiplication and QCD because of difference in element size of the registers. On Livermore kernel 7, the power performance of the 4FMAs SIMD-type accelerators exceeds that of the 4FMAs vector-type because of register reuse. However, the 16FMAs vector-type accelerators have an advantage in almost all simulations, excluding main memory bandwidth intensive benchmarks.

[1] T. Hiramoto,et al. Mobility Enhancement in Uniaxially Strained (110) Oriented Ultra-Thin Body Single- and Double-Gate MOSFETs with SOI Thickness of Less Than 4 nm , 2007, 2007 IEEE International Electron Devices Meeting.

[2] S. Maegawa,et al. Silicon on thin BOX: a new paradigm of the CMOSFET for low-power high-performance application featuring wide-range back-bias control , 2004, IEDM Technical Digest. IEEE International Electron Devices Meeting, 2004..

[3] F. H. Mcmahon,et al. The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[4] Christoforos E. Kozyrakis,et al. A case for intelligent RAM , 1997, IEEE Micro.

[5] Peter M. Kogge,et al. A parallel processing chip with embedded DRAM macros , 1996, IEEE J. Solid State Circuits.

[6] H. Peter Hofstee,et al. Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[7] S. Tam,et al. A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[8] R. Kumar,et al. An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[9] Henk A. van der Vorst,et al. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[10] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[11] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[12] S.Aoki,et al. Lattice QCD on Earth Simulator , 2003 .

[13] Fumio Arakawa,et al. An embedded processor core for consumer appliances with 2.8GFLOPS and 36M polygons/s FPU , 2004, 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519).

[14] C. Lemuet,et al. The Potential Energy Efficiency of Vector Acceleration , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[15] Hiroshi Nakamura,et al. Performance of lattice QCD programs on CP-PACS , 1999, Parallel Computing.

[16] Hiroshi Nakamura,et al. Software-controlled on-chip memory for high-performance and low-power computing , 2002, CARN.