Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs

Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchronization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would provide higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the original MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific workloads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs parallelism. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the network constraints. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trialand-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the search process. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Tareador dynamically calculates data dependencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC.

[1]  Xiangyu Zhang,et al.  Alchemist: A Transparent Dependence Distance Profiling Infrastructure , 2009, 2009 International Symposium on Code Generation and Optimization.

[2]  Mikko H. Lipasti,et al.  Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[3]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[4]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[5]  Torsten Suel,et al.  BSPlib: The BSP programming library , 1998, Parallel Comput..

[6]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[7]  Darren J. Kerbyson,et al.  MPI tools and performance studies - Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications , 2006, SC.

[8]  Pavan Balaji,et al.  Are nonblocking networks really needed for high-end-computing workloads? , 2008, 2008 IEEE International Conference on Cluster Computing.

[9]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[10]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[11]  Brian Demsky,et al.  OoOJava: software out-of-order execution , 2011, PPoPP '11.

[12]  James E. Smith,et al.  The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[13]  David A. Patterson,et al.  Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[14]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[15]  Alan Mycroft,et al.  Estimating and Exploiting Potential Parallelism by Source-Level Dependence Profiling , 2010, Euro-Par.

[16]  Luiz De Rose,et al.  Cray Performance Analysis Tools , 2008, Parallel Tools Workshop.

[17]  Paul M. Carpenter,et al.  Starsscheck: A Tool to Find Errors in Task-Based Parallel Programs , 2010, Euro-Par.

[18]  Katherine A. Yelick,et al.  Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[19]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[20]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[21]  Dharma P. Agrawal,et al.  Modeling of parallel software for efficient computation communication overlap , 1987, FJCC.

[22]  Jesús Labarta,et al.  Validation of Dimemas Communication Model for MPI Collective Operations , 2000, PVM/MPI.

[23]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[24]  Ryan Eccles,et al.  Exploring parallel programming knowledge in the novice , 2005, 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05).

[25]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[26]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[27]  John M. Mellor-Crummey,et al.  Co-array Fortran Performance and Potential: An NPB Experimental Study , 2003, LCPC.

[28]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[29]  Matt T. Yourst PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[30]  Wei Liu,et al.  iWatcher: efficient architectural support for software debugging , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[31]  Nathan R. Tallent,et al.  Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.

[32]  Christopher Wilson,et al.  COMB: a portable benchmark suite for assessing MPI overlap , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[33]  Bernd Mohr,et al.  The Scalasca performance toolset architecture , 2010, Concurr. Comput. Pract. Exp..

[34]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[35]  Mitsuhisa Sato,et al.  Identifying the capability of overlapping computation with communication , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[36]  Keith D. Underwood,et al.  Analyzing the Impact of Overlap, Offload, and Independent Progress for Message Passing Interface Applications , 2005, Int. J. High Perform. Comput. Appl..

[37]  Jack J. Dongarra,et al.  Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[38]  Jeroen Tromp,et al.  High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[40]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[41]  Siegfried Benkner VFC: The Vienna Fortran Compiler , 1999, Sci. Program..

[42]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[43]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[44]  Mayank Agarwal,et al.  SPARTAN: A software tool for Parallelization Bottleneck Analysis , 2009, 2009 ICSE Workshop on Multicore Software Engineering.

[45]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[46]  Interner Bericht VAMPIR: Visualization and Analysis of MPI Resources , 1996 .

[47]  R. Schaller,et al.  Moore's law: past, present and future , 1997 .

[48]  Jeffrey C. Carver,et al.  Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[49]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[50]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[51]  William Gropp,et al.  Toward Scalable Performance Visualization with Jumpshot , 1999, Int. J. High Perform. Comput. Appl..

[52]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[53]  Jesús Labarta,et al.  DiP: A Parallel Program Development Environment , 1996, Euro-Par, Vol. II.

[54]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[55]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[56]  Costin Iancu,et al.  HUNTing the overlap , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[57]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[58]  Aart J. C. Bik,et al.  Efficient Exploitation of Parallelism on Pentium III and Pentium 4 Processor-Based Systems , 2001 .

[59]  Keith D. Underwood,et al.  An Initial Analysis of the Impact of Overlap and Independent Progress for MPI , 2004, PVM/MPI.

[60]  Martin Schulz,et al.  Scalable compression and replay of communication traces in massively parallel environments , 2006, SC.

[61]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[62]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[63]  Nathan R. Tallent,et al.  Effective performance measurement and analysis of multithreaded applications , 2009, PPoPP '09.

[64]  Guillaume Houzeaux,et al.  A variational multiscale model for the advection–diffusion–reaction equation , 2009 .

[65]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[66]  F. Wolf,et al.  Performance Profiling and Analysis of DoD Applications Using PAPI and TAU , 2005, 2005 Users Group Conference (DOD-UGC'05).

[67]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[68]  Alejandro Duran,et al.  Trace-driven simulation of multithreaded applications , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[69]  Mateo Valero,et al.  Quantifying the Potential Task-Based Dataflow Parallelism in MPI Applications , 2011, Euro-Par.

[70]  Jesús Labarta,et al.  Exploring the predictability of MPI messages , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[71]  Manish Gupta,et al.  Compiler-controlled extraction of computation-communication overlap in MPI applications , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[72]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[73]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.