Identifying Critical Code Sections in Dataflow Programming Models

The years of practice in optimizing applications point that the major issue is focus - identifying the critical code section whose optimization would yield the highest overall speedup. While this issue is mainly solved for sequential applications, it remains a serious hurdle in the world of parallel computing. Furthermore, the newest dataflow parallel programming models expose very irregular parallelism, making the identification of the critical code section even harder. To address this issue, we designed an environment that identifies critical code sections in applications. The programmer can use this environment to estimate the potential benefits of the optimization for a specific parallel platform. This is very important because the programmer can anticipate the benefits of his optimization and assure that the optimization is worth the effort. Furthermore, we showed that in many applications, the choice of the critical code section decisively depends on the configuration of the target machine. For instance, in HP Linpack, optimizing a task that takes 0.49% of the total computation time yields the overall speedup of less than 0.25% on one machine, and at the same time, yields the overall speedup of more than 24% on a machine with different number of cores.

[1]  Jeroen Tromp,et al.  High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[3]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[4]  Thomas E. Anderson,et al.  Quartz: a tool for tuning parallel program performance , 1990, SIGMETRICS '90.

[5]  Jesús Labarta,et al.  Validation of Dimemas Communication Model for MPI Collective Operations , 2000, PVM/MPI.

[6]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[7]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[8]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[9]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[10]  Michael A. Laurenzano,et al.  High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors , 2008, HiPC 2008.

[11]  Alejandro Duran,et al.  A Proposal for Task Parallelism in OpenMP , 2007, IWOMP.

[12]  Mayank Agarwal,et al.  SPARTAN: A software tool for Parallelization Bottleneck Analysis , 2009, 2009 ICSE Workshop on Multicore Software Engineering.

[13]  Mateo Valero,et al.  Quantifying the Potential Task-Based Dataflow Parallelism in MPI Applications , 2011, Euro-Par.

[14]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[15]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[16]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[17]  Nathan R. Tallent,et al.  Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.