An Empirical Study of Intel Xeon Phi

With at least 50 cores, Intel Xeon Phi is a true many-core architecture. Featuring fairly powerful cores, two cache levels, and very fast interconnections, the Xeon Phi can get a theoretical peak of 1000 GFLOPs and over 240 GB/s. These numbers, as well as its flexibility - it can be used both as a coprocessor or as a stand-alone processor - are very tempting for parallel applications looking for new performance records. In this paper, we present an empirical study of Xeon Phi, stressing its performance limits and relevant performance factors, ultimately aiming to present a simplified view of the machine for regular programmers in search for performance. To do so, we have micro-benchmarked the main hardware components of the processor - the cores, the memory hierarchies, the ring interconnect, and the PCIe connection. We show that, in ideal microbenchmarking conditions, the performance that can be achieved is very close to the theoretical peak, as given in the official programmer's guide. We have also identified and quantified several causes for significant performance penalties. Our findings have been captured in four optimization guidelines, and used to build a simplified programmer's view of Xeon Phi, eventually enable the design and prototyping of applications on a functionality-based model of the architecture.

[1]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[2]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[3]  David R. Butenhof Programming with POSIX threads , 1993 .

[4]  Christopher J. Hughes,et al.  Performance and Energy Implications of Many-Core Caches for Throughput Computing , 2010, IEEE Micro.

[5]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[6]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[7]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[8]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[9]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[10]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[11]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[12]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[13]  Thomas Fahringer,et al.  Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design , 2011, Euro-Par.

[14]  Henk Sips,et al.  Parallel and Distributed Systems Report Series Benchmarking Intel Xeon Phi to Guide Kernel Design Information about Parallel and Distributed Systems Report Series: Benchmarking Intel Xeon Phi to Guide Kernel Designwp , 2022 .

[15]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[16]  Sabela Ramos,et al.  Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[17]  Dirk Schmidl,et al.  Assessing the Performance of OpenMP Programs on the Intel Xeon Phi , 2013, Euro-Par.

[18]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.

[19]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Carl Staelin,et al.  Memory hierarchy performance measurement of commercial dual-core desktop processors , 2008, J. Syst. Archit..