Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology

Executive Summary Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming manycore architectures in delivering significant speedup, and close-to optimal performance for commonly used parallel computing workloads. We first quantify the extent of the " Ninja gap " , which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-coreWestmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel Many Integrated Core (MIC) architecture which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

[1]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Richard W. Vuduc,et al.  Direct N-body Kernels for Multicore Platforms , 2009, 2009 International Conference on Parallel Processing.

[4]  Samuel Williams,et al.  Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..

[5]  Trevor N. Mudge,et al.  Power: A First-Class Architectural Design Constraint , 2001, Computer.

[6]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[7]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[8]  Sverre J. Aarseth Gravitational N-Body Simulations: Tools and Algorithms , 2003 .

[9]  Pradeep Dubey,et al.  Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures , 2009, IEEE Transactions on Visualization and Computer Graphics.

[10]  M. Knaup,et al.  Hyperfast Perspective Cone--Beam Backprojection , 2006, 2006 IEEE Nuclear Science Symposium Conference Record.

[11]  Leila Ismail,et al.  Performance Evaluation of Convolution on the Cell Broadband Engine Processor , 2011, IEEE Transactions on Parallel and Distributed Systems.

[12]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[13]  William J. Dally The end of denial architecture and the rise of throughput computing , 2009, DAC 2009.

[14]  M. Musiela,et al.  The Market Model of Interest Rate Dynamics , 1997 .

[15]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[16]  Pradeep Dubey,et al.  Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications , 2008, Proceedings of the IEEE.

[17]  Uday Bondhugula,et al.  Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU , 2010 .

[18]  Mahmut T. Kandemir,et al.  Cache topology aware computation mapping for multicores , 2010, PLDI '10.

[19]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[20]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[21]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[22]  Hsien-Hsin S. Lee,et al.  An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[23]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[24]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[25]  Andreas Polze,et al.  Joint Forces: From Multithreaded Programming to GPU Computing , 2011, IEEE Software.

[26]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[27]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[28]  Pradeep Dubey,et al.  Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..

[29]  William J. Dally,et al.  A portable runtime interface for multi-level memory hierarchies , 2008, PPoPP.

[30]  Michael C. Sukop,et al.  Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers , 2005 .

[31]  Ryan Newton,et al.  A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops , 2011, IEEE Software.

[32]  Jin Zhou,et al.  Bamboo: a data-centric, object-oriented approach to many-core software , 2010, PLDI '10.

[33]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34]  Richard Henderson,et al.  Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[35]  F. Black,et al.  The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.

[36]  Michael B. Giles Monte Carlo evaluation of sensitivities in computational finance , 2007 .

[37]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .