Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.

[1]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[2]  Dietmar Fey,et al.  Higher-level parallelization for local and distributed asynchronous task-based programming , 2015, ESPM '15.

[3]  G. R. Mudalige,et al.  OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures , 2012, 2012 Innovative Parallel Computing (InPar).

[4]  Paul H. J. Kelly,et al.  Mesh independent loop fusion for unstructured mesh applications , 2012, CF '12.

[5]  Jeanine Cook,et al.  Using Intrinsic Performance Counters to Assess Efficiency in Task-Based Parallel Applications , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[6]  Dietmar Fey,et al.  Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers , 2013, ScalA '13.

[7]  Lorna Smith Mixed Mode MPI / OpenMP Programming , 2000 .

[8]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[9]  Nancy M. Amato,et al.  Run-time methods for parallelizing partially parallel loops , 1995, ICS '95.

[10]  Chirag Dekate,et al.  Extreme scale parallel NBody algorithm with event driven constraint based execution model , 2011 .

[11]  Paul H. J. Kelly,et al.  Performance Analysis and Optimization of the OP2 Framework on Many-Core Architectures , 2012, Comput. J..

[12]  Dietmar Fey,et al.  High Performance Computing , 2016, Lecture Notes in Computer Science.

[13]  Carlo Bertolli,et al.  Designing OP2 for GPU architectures , 2013, J. Parallel Distributed Comput..

[14]  Paul H. J. Kelly,et al.  Performance analysis of the OP2 framework on many-core architectures , 2011, PERV.

[15]  M. Frans Kaashoek,et al.  Software prefetching and caching for translation lookaside buffers , 1994, OSDI '94.

[16]  J. Ramanujam,et al.  Using HPX and OP2 for Improving Parallel Scaling Performance of Unstructured Grid Applications , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).

[17]  Nancy M. Amato,et al.  A scalable method for run-time loop parallelization , 1995, International Journal of Parallel Programming.

[18]  Michael J. Flynn,et al.  Hardware and software cache prefetching techniques for MPEG benchmarks , 2000, IEEE Trans. Circuits Syst. Video Technol..

[19]  Thomas Heller,et al.  Application of the ParalleX execution model to stencil-based problems , 2012, Computer Science - Research and Development.

[20]  Lawrence Rauchwerger,et al.  Implementation Issues of Loop-Level Speculative Run-Time Parallelization , 1999, CC.

[21]  Brad Calder,et al.  Pointer cache assisted prefetching , 2002, MICRO.

[22]  Jeanine Cook,et al.  The Performance Implication of Task Size for Applications on the HPX Runtime System , 2015, 2015 IEEE International Conference on Cluster Computing.

[23]  Donald Yeung,et al.  The Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems , 2004, J. Instr. Level Parallelism.

[24]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[25]  D. Ghate,et al.  Using Automatic Differentiation for Adjoint CFD Code Development , 2005 .

[26]  Jaejin Lee,et al.  Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems , 2009, IEEE Transactions on Parallel and Distributed Systems.

[27]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977, Artificial Intelligence and Programming Languages.

[28]  Martin Burtscher,et al.  Efficient emulation of hardware prefetchers via event-driven helper threading , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[29]  Paul H. J. Kelly,et al.  Design and Performance of the OP2 Library for Unstructured Mesh Applications , 2011, Euro-Par Workshops.