A programming system for xeon phis with runtime SIMD parallelization

The Intel Xeon Phi offers a promising solution to coprocessing, since it is based on the popular x86 instruction set. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPGPUs with CUDA or OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even reductions in the past. In this paper, we consider the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reordering and our methods to effectively manage control flow. Our extensive evaluation with 6 popular applications shows large gains over the SIMD parallelization achieved by the production (ICC) compiler, and we even outperform OpenMP for MIMD parallelism.

[1]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[3]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[4]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[5]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[7]  Bingsheng He,et al.  Optimizing the MapReduce framework on Intel Xeon Phi coprocessor , 2013, 2013 IEEE International Conference on Big Data.

[8]  Pradeep Dubey,et al.  PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors , 2011, Proc. VLDB Endow..

[9]  Gagan Agrawal,et al.  Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[10]  Seonggun Kim,et al.  Efficient SIMD code generation for irregular kernels , 2012, PPoPP '12.

[11]  Kevin Skadron,et al.  A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations , 2011, International Journal of Parallel Programming.

[12]  James R. Larus,et al.  SIMD parallelization of applications that traverse irregular data structures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  Milind Girkar,et al.  Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[14]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  Bo Wu,et al.  Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[16]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[17]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[18]  Siegfried Benkner,et al.  Efficient Hybrid Execution of C++ Applications using Intel(R) Xeon Phi(TM) Coprocessor , 2012, ArXiv.

[19]  Gagan Agrawal,et al.  An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs , 2011, ICS '11.

[20]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[21]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[22]  M. Pharr,et al.  ispc: A SPMD compiler for high-performance CPU programming , 2012, 2012 Innovative Parallel Computing (InPar).

[23]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Ayal Zaks,et al.  Auto-vectorization of interleaved data for SIMD , 2006, PLDI '06.

[25]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[26]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[27]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[28]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.