Importance of explicit vectorization for CPU and GPU software performance

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version, indicating the importance of optimizing CPU implementations.

[1]  F. Guerra Spin Glasses , 2005, cond-mat/0507581.

[2]  Junyi Xia,et al.  High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy. , 2008, Medical physics.

[3]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[4]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[5]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[6]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[7]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[8]  Saraju P. Mohanty GPU-CPU multi-core for real-time signal processing , 2009, 2009 Digest of Technical Papers International Conference on Consumer Electronics.

[9]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[10]  Fabián A. Chudak,et al.  Investigating the performance of an adiabatic quantum optimization processor , 2010, Quantum Inf. Process..

[11]  M. Suzuki,et al.  Generalized Trotter's formula and systematic approximants of exponential operators and inner derivations with applications to many-body problems , 1976 .

[12]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[13]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[14]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[15]  Firas Hamze,et al.  High-performance Physics Simulations Using Multi-core CPUs and GPGPUs in a Volunteer Computing Context , 2011, Int. J. High Perform. Comput. Appl..

[16]  Firas Hamze,et al.  Robust Parameter Selection for Parallel Tempering , 2010 .

[17]  Jason Wittenberg,et al.  Clarify: Software for Interpreting and Presenting Statistical Results , 2003 .

[18]  Peter Stone,et al.  Improving particle filter performance using SSE instructions , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  James B. Anderson,et al.  Quantum Monte Carlo: Origins, Development, Applications , 2007 .

[20]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[21]  Wolfgang Paul,et al.  GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model , 2009, J. Comput. Phys..

[22]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[23]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[24]  Michael Gschwind,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture , 2006, IBM Syst. J..

[25]  Hamid Sarbazi-Azad,et al.  Efficient SIMD Numerical Interpolation , 2005, HPCC.

[26]  Stanimire Tomov,et al.  Benchmarking and implementation of probability-based simulations on programmable graphics cards , 2003, Comput. Graph..

[27]  Nobuhiko Saitô,et al.  Statistical Physics I : Equilibrium Statistical Mechanics , 1983 .

[28]  L. Ridgway Scott,et al.  Scientific Parallel Computing , 2005 .

[29]  Emile H. L. Aarts,et al.  Simulated annealing and Boltzmann machines - a stochastic approach to combinatorial optimization and neural computing , 1990, Wiley-Interscience series in discrete mathematics and optimization.