Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its native x86 support, the Xeon Phi may also be used stand-alone, meaning codes can be executed directly on the device without the need for interaction with a host. In this sense, the Xeon Phi resembles a big SMP on a chip if its 240 logical cores are compared to a common Xeon-based compute node offering up to 32 logical cores. In this work, we compare a Xeon-based two-socket compute node with the Xeon Phi stand-alone in scalability and performance using OpenMP codes. Considering both as individual SMP systems, they come at a very similar price and power envelope, but our results show significant differences in absolute application performance and scalability. We also show in how far common programming idioms for the Xeon multi-core architecture are applicable for the Xeon Phi many-core architecture and which challenges the changing ratio of core count to single core performance poses for the application programmer.

[1]  Hermann Ney,et al.  Features for image retrieval: an experimental comparison , 2008, Information Retrieval.

[2]  Dirk Schmidl,et al.  Task-Parallel Programming on NUMA Architectures , 2012, Euro-Par.

[3]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[4]  Christian Brecher,et al.  Simulation of bevel gear cutting with GPGPUs—performance and productivity , 2011, Computer Science - Research and Development.

[5]  Dirk Schmidl,et al.  Assessing OpenMP Tasking Implementations on NUMA Architectures , 2012, IWOMP.

[6]  Matthias S. Müller,et al.  OpenMP in a Heterogeneous World , 2012, Lecture Notes in Computer Science.

[7]  Dirk Schmidl,et al.  Data and thread affinity in openmp programs , 2008, MAW '08.

[8]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[9]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[10]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[11]  Mikhail Smelyanskiy,et al.  Efficient backprojection-based synthetic aperture radar computation with many-core processors , 2012, HiPC 2012.

[12]  Malcolm P. Atkinson,et al.  An Adaptive, Scalable, and Portable Technique for Speeding Up MPI-Based Applications , 2012, Euro-Par.

[13]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[14]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[15]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  C. Bischof,et al.  Nested OpenMP for Efficient Computation of 3D Critical Points in Multi-Block CFD Datasets , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[17]  Michael Klemm,et al.  OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[18]  H. Martin Bücker,et al.  Parallel Minimum p-Norm Solution of the Neuromagnetic Inverse Problem for Realistic Signals Using Exact Hessian-Vector Products , 2008, SIAM J. Sci. Comput..

[19]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[20]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[21]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .