Cache Performance Optimizations for Parallel Lattice Boltzmann Codes