论文信息 - Performance evaluation and optimization of random memory access on multicores with high productivity

Performance evaluation and optimization of random memory access on multicores with high productivity

The slow progress in memory access latencies in comparison to CPU speeds has resulted in memory accesses dominating code performance. While architectural enhancements have benefited applications with data locality and sequential access, random memory access still remains a cause for concern. Several benchmarks have been proposed to evaluate the random memory access performance on multicore architectures. However, the performance evaluation models used by the existing benchmarks do not fully capture the varying types of random access behaviour arising in practical applications. In this paper, we propose a new model for evaluating the performance of random memory access that better captures the random access behaviour demonstrated by applications in practice. We use our model to evaluate the performance of two popular multicore architectures, the Cell and the GPU. We also suggest novel optimizations on these architectures that significantly boost the performance for random accesses in comparison to conventional architectures. Performance improvements on these architectures typically come at the cost of reduced productivity considering the extra programming effort involved. To address this problem, we propose libraries that incorporate these optimizations and provide innovatively designed programming interfaces that can be used by the applications to achieve good performance without loss of productivity.

Yogish Sabharwal | Pramod Bhatotia | Vaibhav Saxena

[1] Jack Dongarra,et al. Introduction to the HPCChallenge Benchmark Suite , 2004 .

[2] Philip Heidelberger,et al. HPCC RandomAccess benchmark for next generation supercomputers , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[3] Yogish Sabharwal,et al. Software Routing and Aggregation of Messages to Optimize the Performance of HPCC Randomaccess Benchmark , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[4] Tao Zhang,et al. Prefetching irregular references for software cache on cell , 2008, CGO '08.

[5] Rodney A. Kennedy,et al. Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices , 2007 .

[6] Courtenay T. Vaughan,et al. A Simple Synchronous Distributed-Memory Algorithm for the HPCC RandomAccess Benchmark , 2006, 2006 IEEE International Conference on Cluster Computing.

[7] I. Wald,et al. Ray Tracing on the Cell Processor , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[8] J. Hornegger,et al. Fast GPU-Based CT Reconstruction using the Common Unified Device Architecture (CUDA) , 2007, 2007 IEEE Nuclear Science Symposium Conference Record.

[9] Jason N. Dale,et al. Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[10] Eduard Ayguadé,et al. A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor , 2007, LCPC.