A memory accelerator with gather functions for bandwidth-bound irregular applications

Compute intensive processing can be easily accelerated using processors with many cores such as GPUs. However, memory bandwidth limitation becomes serious year by year for memory bandwidth intensive applications such as sparse matrix vector multiplications (SpMV). In order to accelerate memory bandwidth intensive applications, we have proposed a memory system with additional functions of scattering and gathering. For the preliminary evaluation of our proposed system, we assumed that the throughput of the memory system was sufficient. In this paper, we propose a memory system with scattering and gathering using many narrow memory channels. We evaluate the feasible throughput of the proposed memory system based on DDR3 DRAM with the modified DRAMsim2 simulator. In addition, we evaluate the performance of SpMV using our method for the proposed memory system and a GPU. We have confirmed the proposed memory system has good performance and good stability for matrix shape variation using fewer pins for external memory.

[1]  Masami Takata,et al.  Scaleable Sparse Matrix-Vector Multiplication with Functional Memory and GPUs , 2011, 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing.

[2]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[3]  Takumi Maruyama,et al.  SPARC64 XII: Fujitsu's Latest 12-Core Processor for Mission-Critical Servers , 2018, IEEE Micro.

[4]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[5]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[6]  Noboru Tanabe,et al.  An Enhancer of Memory and Network for Cluster and its Applications , 2008, 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies.

[7]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[8]  K. Tanaka,et al.  Highly Functional Memory Architecture for Large-Scale Data Applications , 2004, Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04).

[9]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[10]  Hiroshi Okano,et al.  Sparc64 VIIIfx: A New-Generation Octocore Processor for Petascale Computing , 2010, IEEE Micro.

[11]  Y. Dohi,et al.  A New Memory Module for COTS-Based Personal Supercomputing , 2004, Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04).

[12]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[13]  Noboru Tanabe,et al.  An enhancer of memory and network for applications with large-capacity data and non-continuous data accessing , 2009, The Journal of Supercomputing.

[14]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.