Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Intel MIC (Many Integrated Core) is the first x86-based coprocessor architecture aimed at accelerating multi-core HPC applications. In the most common usage model, parallel code sections are offloaded to the MIC coprocessor using LEO (Language Extensions for Offload). The developer is responsible for identifying and specifying offloadable code regions, managing data transfers between the CPU and MIC and optimizing the application for performance, which requires some amount of effort and experimentation. In this paper, we present Apricot, an optimizing compiler and productivity tool for x86-compatible many-core coprocessors (such as Intel MIC) that minimizes developer effort by (i) automatically inserting LEO clauses for parallelizable code regions, (ii) selectively offloading some of the code regions to the coprocessor at runtime based on a cost model that we have developed, (iii) applying a set ofoptimizations for minimizing the data communication overhead and improving overall performance. Apricot is intended to assist programmers in porting existing multi-core applications and writing new ones to take advantage of the many-core coprocessor, while maximizing overall performance. Experiments with SpecOMP and NAS Parallel benchmarks show that Apricot can successfully transform OpenMP applications to run on the MIC coprocessor with good performance gains.

[1]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[2]  Surendra Byna,et al.  Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory , 2010, SPAA '10.

[3]  Albert Cohen,et al.  Induction Variable Analysis with Delayed Abstractions , 2005, HiPEAC.

[4]  François Irigoin,et al.  Interprocedural Array Region Analyses , 1995, Int. J. Parallel Program..

[5]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[6]  Edward T. Grochowski,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2008, IEEE Micro.

[7]  Steven S. Lumetta,et al.  CIGAR: Application Partitioning for a CPU/Coprocessor Architecture , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[8]  Gary A. Kildall,et al.  A unified approach to global program optimization , 1973, POPL.

[9]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[10]  William H. Harrison,et al.  Compiler Analysis of the Value Ranges for Variables , 1977, IEEE Transactions on Software Engineering.

[11]  Rudolf Eigenmann,et al.  A hybrid approach of OpenMP for clusters , 2012, PPoPP '12.

[12]  Martin C. Rinard,et al.  Symbolic bounds analysis of pointers, array indices, and accessed memory regions , 2005, TOPL.

[13]  Steven S. Lumetta,et al.  CUBA: an architecture for efficient CPU/co-processor data communication , 2008, ICS '08.

[14]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[15]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[17]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[18]  Andrew Richards,et al.  Automatic Offloading of C++ for the Cell BE Processor: A Case Study Using Offload , 2010, 2010 International Conference on Complex, Intelligent and Software Intensive Systems.