Rigel: A Framework for OpenMP PerformanceTuning

OpenMP allows developers to harness the power of shared memory multiprocessing in C and C++ applications, but the performance gained with OpenMP is highly sensitive to the underlying hardware, making performance portability across different hardware architectures fragile. For example, in mapping a parallel for loop to hardware, OpenMP 4 offers commands for exploiting vector instructions (simd directives) and automatic GPU offloading (target directives), as well as schedule directives for CPU load balancing. These benefits come with a cost. A developer has to be well aware of the architecture details, and the application, and must iteratively tune to determine the best combination of pragma directives delivering higher performance for the given target architecture. Hence in this paper we introduce Rigel, a framework that automates these decisions to arrive at optimized OpenMP annotated code. Given a code segment with inherent parallelism, our framework uses separate machine learning classification models to predict the anticipated benefit of each optimization. Both Vector Classification and GPU Offloading Classification models perform with average accuracies of 83%. Succeeding the classification process, code segments are optimized accordingly. Our results show that GPU offloading optimization lead to an average speedup of 8x over default non-optimized CPU parallel execution (pragma omp parallel for) and average Vector optimization speedup is 6x compared to LLVM Clang 4.0 auto-vectorization. Furthermore Scheduling mechanism selection process results in overall accuracy of 90%.

[1]  Lieven Eeckhout,et al.  Comparing Benchmarks Using Key Microarchitecture-Independent Characteristics , 2006, 2006 IEEE International Symposium on Workload Characterization.

[2]  Tian Jin,et al.  Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support , 2016, 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[3]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[4]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[5]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[6]  P. Deufihard,et al.  On algorithms for the summation of certain special functions , 1976, Computing.

[7]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[8]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[9]  G.,et al.  Ensemble Methods in Machine , 2007 .

[10]  David M. Brooks,et al.  Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006, ASPLOS XII.

[11]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Xiaojin Zhu,et al.  Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Michael Gerndt,et al.  : A Profiling Tool for OpenMP , 2005, IWOMP.

[15]  Jack J. Dongarra,et al.  Vectorizing compilers: a test suite and results , 1988, Proceedings. SUPERCOMPUTING '88.

[16]  David M. Brooks,et al.  CPR: Composable performance regression for scalable multiprocessor models , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[17]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Alexander V. Veidenbaum,et al.  Using Hardware Counters to Predict Vectorization , 2017, LCPC.

[19]  Martin Rinard,et al.  Using Code Perforation to Improve Performance, Reduce Energy Consumption, and Respond to Failures , 2009 .

[20]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[21]  Gu-Yeon Wei,et al.  Profiling a Warehouse-Scale Computer , 2016, IEEE Micro.

[22]  Eiji Yamanaka,et al.  Predicting Vectorization Profitability Using Binary Classification , 2014, IEICE Trans. Inf. Syst..

[23]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24]  Tian Jin,et al.  Offloading Support for OpenMP in Clang and LLVM , 2016, 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC).

[25]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.