Lattice Boltzmann simulation optimization on leading multicore platforms

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to a lattice Boltzmann application (LBMHD) that historically has made poor use of scalar microprocessors due to its complex data structures and memory access patterns. We explore one of the broadest sets of multicore architectures in the HPC literature, including the Intel Clovertown, AMD Opteron X2, Sun Niagara!, STI Cell, as well as the single core Intel Itanium.2. Rather than hand-tuning LBMHD for each system, we develop a code generator that allows us identify a highly optimized version for each platform, while amortizing the human programming effort. Results show that our auto- tuned LBMHD application achieves up to a 14times improvement compared with the original code. Additionally, we present detailed analysis of each optimization, which reveal surprising hardware bottlenecks and software challenges for future multicore systems and applications.

[1]  P. Bhatnagar,et al.  A Model for Collision Processes in Gases. I. Small Amplitude Processes in Charged and Neutral One-Component Systems , 1954 .

[2]  D. Martínez,et al.  Lattice Boltzmann magnetohydrodynamics , 1994, comp-gas/9401002.

[3]  W. Shyy,et al.  Lattice Boltzmann Method for 3-D Flows with Curved Boundary , 2000 .

[4]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[5]  Dennis Sylvester,et al.  Impact of small process geometries on microarchitectures in systems on a chip , 2001 .

[6]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[7]  Jarek Nieplocha,et al.  Efficient Algorithms for Ghost Cell Updates on Two Classes of MPP Architectures , 2002, IASTED PDCS.

[8]  P. Dellar Lattice Kinetic Schemes for Magnetohydrodynamics , 2002 .

[9]  Ulrich Rüde,et al.  Optimization and Profiling of the Cache Performance of Parallel Lattice Boltzmann Codes in 2 D and 3 D ∗ , 2003 .

[10]  Leonid Oliker,et al.  Scientific Computations on Modern Parallel Vector Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[11]  Leonid Oliker,et al.  A Performance Evaluation of the Cray X1 for Scientific Applications , 2004, VECPAR.

[12]  Leonid Oliker,et al.  Magnetohydrodynamic Turbulence Simulations on the Earth Simulator Using the Lattice Boltzmann Method , 2005 .

[13]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[14]  Leonid Oliker,et al.  Leading Computational Methods on Scalar and Vector HEC Platforms , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[15]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[16]  Gerhard Wellein,et al.  On the single processor performance of simple lattice Boltzmann kernels , 2006 .

[17]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[18]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[19]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[20]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[21]  G. Wellein,et al.  Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method , 2008 .