Design and Power Performance Evaluation of On-Chip Memory Processor with Arithmetic Accelerators

In this paper, we design an on-chip memory processor with arithmetic accelerators, which are expected to improve power consumption. In addition, we evaluate the power performance of the processor. We propose implementing vector-type arithmetic accelerators and SIMD-type arithmetic accelerators in the on-chip memory processor. The evaluation results obtained using our simulator indicate that the performance of the 4FMAs SIMD-type accelerators is similar to that of the 4FMAs vector-type accelerators on DAXPY, Livermore kernel 1 and 3. However, the performance of the 4FMAs vector-type accelerator exceeds that of the 4FMAs SIMD-type accelerator with respect to matrix multiplication and QCD because of difference in element size of the registers. On Livermore kernel 7, the power performance of the 4FMAs SIMD-type accelerators exceeds that of the 4FMAs vector-type because of register reuse. However, the 16FMAs vector-type accelerators have an advantage in almost all simulations, excluding main memory bandwidth intensive benchmarks.

[1]  T. Hiramoto,et al.  Mobility Enhancement in Uniaxially Strained (110) Oriented Ultra-Thin Body Single- and Double-Gate MOSFETs with SOI Thickness of Less Than 4 nm , 2007, 2007 IEEE International Electron Devices Meeting.

[2]  S. Maegawa,et al.  Silicon on thin BOX: a new paradigm of the CMOSFET for low-power high-performance application featuring wide-range back-bias control , 2004, IEDM Technical Digest. IEEE International Electron Devices Meeting, 2004..

[3]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[4]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[5]  Peter M. Kogge,et al.  A parallel processing chip with embedded DRAM macros , 1996, IEEE J. Solid State Circuits.

[6]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[7]  S. Tam,et al.  A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[8]  R. Kumar,et al.  An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[9]  Henk A. van der Vorst,et al.  Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[10]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[11]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[12]  S.Aoki,et al.  Lattice QCD on Earth Simulator , 2003 .

[13]  Fumio Arakawa,et al.  An embedded processor core for consumer appliances with 2.8GFLOPS and 36M polygons/s FPU , 2004, 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519).

[14]  C. Lemuet,et al.  The Potential Energy Efficiency of Vector Acceleration , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[15]  Hiroshi Nakamura,et al.  Performance of lattice QCD programs on CP-PACS , 1999, Parallel Computing.

[16]  Hiroshi Nakamura,et al.  Software-controlled on-chip memory for high-performance and low-power computing , 2002, CARN.