Performance and energy evaluation of data prefetching on intel Xeon Phi

There is an urgent need to evaluate the existing parallelism and data locality-oriented techniques on emerging manycore machines using multithreaded applications. Data prefetching is a well-known latency hiding technique that comes with various hardware- and software-based implementations in almost all commercial machines. A well-tuned prefetcher can reduce the observed data access latencies significantly by bringing the soonto- be-requested data into the cache ahead of time, eventually improving application execution time. Motivated by this, we present in this paper a detailed performance and power characterization of software (compiler-guided) and hardware data prefetching on an Intel Xeon Phi-based system. Our main contributions are (i) an analysis of the interactions between hardware and software prefetching, showing how hardware prefetching can throttle itself in response to software; (ii) results on the power and energy behavior of prefetching, showing how performance and energy gains outweigh the increased power cost of prefetching; and (iii) an evaluation of the use of intrinsic prefetch instructions to prefetch for applications with difficult-to-detect access patterns.

[1]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[2]  Alejandro Duran,et al.  The Intel® Many Integrated Core Architecture , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[3]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[4]  Niall Gaffney,et al.  Performance evaluation of R with Intel Xeon Phi coprocessor , 2013, 2013 IEEE International Conference on Big Data.

[5]  Martin Burtscher,et al.  Future execution: a hardware prefetching technique for chip multiprocessors , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[6]  Siegfried Benkner,et al.  HyPHI - Task Based Hybrid Execution C++ Library for the Intel Xeon Phi Coprocessor , 2013, 2013 42nd International Conference on Parallel Processing.

[7]  David M. Brooks,et al.  Energy characterization and instruction-level energy model of Intel's Xeon Phi processor , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[8]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[9]  Ümit V. Çatalyürek,et al.  Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[10]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[11]  Jesper Larsson Träff,et al.  The Pheet Task-Scheduling Framework on the Intel® Xeon Phi Coprocessor and other Multicore Architectures , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[12]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[13]  Donald Yeung,et al.  Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[14]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[15]  Lars Koesterke,et al.  Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi , 2013, 2013 42nd International Conference on Parallel Processing.

[16]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[17]  Giuseppe Coviello,et al.  COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors , 2013, HPDC '13.

[18]  Michael Klemm,et al.  Extending a Highly Parallel Data Mining Algorithm to the Intel ® Many Integrated Core Architecture , 2011, Euro-Par Workshops.

[19]  Bingsheng He,et al.  Optimizing the MapReduce framework on Intel Xeon Phi coprocessor , 2013, 2013 IEEE International Conference on Big Data.

[20]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[21]  Michel Dubois,et al.  International Conference on Parallel Processing Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 2006 .

[22]  Jianbin Fang,et al.  An Empirical Study of Intel Xeon Phi , 2013, ArXiv.

[23]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[24]  A. Gupta,et al.  Evaluation of Rodinia Codes on Intel Xeon Phi , 2013, 2013 4th International Conference on Intelligent Systems, Modelling and Simulation.

[25]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[26]  Emre Kultursay,et al.  Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[27]  Rudolf Eigenmann,et al.  Data forwarding through in-memory precomputation threads , 2004, ICS '04.

[28]  Sandhya Dwarkadas,et al.  Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[30]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Jean-Loup Baer,et al.  Dynamic Improvement of Locality in Virtual Memory Systems , 1976, IEEE Transactions on Software Engineering.

[32]  Christopher J. Hughes,et al.  Performance and Energy Implications of Many-Core Caches for Throughput Computing , 2010, IEEE Micro.

[33]  Surendra Byna,et al.  A Taxonomy of Data Prefetching Mechanisms , 2008, 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008).

[34]  Richard W. Vuduc,et al.  When Prefetching Works, When It Doesn’t, and Why , 2012, TACO.

[35]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[36]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[37]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[38]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[39]  Fan Ye,et al.  The Exploration of Pervasive and Fine-Grained Parallel Model Applied on Intel Xeon Phi Coprocessor , 2013, 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[40]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .