Can traditional programming bridge the Ninja performance gap for parallel computing applications

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

[1]  Pat Hanrahan,et al.  Volume Rendering , 2020, Definitions.

[2]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Richard W. Vuduc,et al.  Direct N-body Kernels for Multicore Platforms , 2009, 2009 International Conference on Parallel Processing.

[4]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[5]  Pradeep Dubey,et al.  Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications , 2008, Proceedings of the IEEE.

[6]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[7]  Uday Bondhugula,et al.  Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU , 2010 .

[8]  Mahmut T. Kandemir,et al.  Cache topology aware computation mapping for multicores , 2010, PLDI '10.

[9]  Michael B. Giles Monte Carlo evaluation of sensitivities in computational finance , 2007 .

[10]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Richard Henderson,et al.  Multi-platform auto-vectorization , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[12]  Michael C. Sukop,et al.  Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers , 2005 .

[13]  Pradeep Dubey,et al.  Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures , 2009, IEEE Transactions on Visualization and Computer Graphics.

[14]  M. Knaup,et al.  Hyperfast Perspective Cone--Beam Backprojection , 2006, 2006 IEEE Nuclear Science Symposium Conference Record.

[15]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[16]  Pradeep Dubey,et al.  Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..

[17]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[18]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[19]  Samuel Williams,et al.  Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms , 2009, J. Parallel Distributed Comput..

[20]  Trevor N. Mudge,et al.  Power: A First-Class Architectural Design Constraint , 2001, Computer.

[21]  Andreas Polze,et al.  Joint Forces: From Multithreaded Programming to GPU Computing , 2011, IEEE Software.

[22]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[23]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[24]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[25]  Hsien-Hsin S. Lee,et al.  An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[26]  Sverre J. Aarseth,et al.  Gravitational N-Body Simulations , 2003 .

[27]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[28]  Ryan Newton,et al.  A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops , 2011, IEEE Software.

[29]  Jin Zhou,et al.  Bamboo: a data-centric, object-oriented approach to many-core software , 2010, PLDI '10.

[30]  Leila Ismail,et al.  Performance Evaluation of Convolution on the Cell Broadband Engine Processor , 2011, IEEE Transactions on Parallel and Distributed Systems.

[31]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[32]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[33]  Pradeep Dubey,et al.  Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology , 2012 .

[34]  Sverre J. Aarseth Gravitational N-Body Simulations: Tools and Algorithms , 2003 .

[35]  M. Musiela,et al.  The Market Model of Interest Rate Dynamics , 1997 .

[36]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[37]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38]  F. Black,et al.  The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.

[39]  William J. Dally,et al.  A portable runtime interface for multi-level memory hierarchies , 2008, PPoPP.

[40]  William J. Dally The end of denial architecture and the rise of throughput computing , 2009, DAC 2009.