Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

This paper describes performance of OSCAR multigrain parallelizing compiler on various SMP servers, such as IBM pSeries 690, Sun Fire V880, Sun Ultra 80, NEC TX7/i6010 and SGI Altix 3700. The OSCAR compiler hierarchically exploits the coarse grain task parallelism among loops, subroutines and basic blocks and the near fine grain parallelism among statements inside a basic block in addition to the loop parallelism. Also, it allows us global cache optimization over different loops, or coarse grain tasks, based on data localization technique with inter-array padding to reduce memory access overhead. Current performance of OSCAR compiler is evaluated on the above SMP servers. For example, the OSCAR compiler generating OpenMP parallelized programs from ordinary sequential Fortran programs gives us 5.7 times speedup, in the average of seven programs, such as SPEC CFP95 tomcatv, swim, su2cor, hydro2d, mgrid, applu and turb3d, compared with IBM XL Fortran compiler 8.1 on IBM pSeries 690 24 processors SMP server. Also, it gives us 2.6 times speedup compare with Intel Fortran Itanium Compiler 7.1 on SGI Altix 3700 Itanium 2 16 processors server, 1.7 times speedup compared with NEC Fortran Itanium Compiler 3.4 on NEC TX7/i6010 Itanium 2 8 processors server, 2.5 times speedup compared with Sun Forte 7.0 on Sun Ultra 80 UltraSPARC II 4 processors desktop workstation, and 2.1 times speedup compare with Sun Forte compiler 7.1 on Sun Fire V880 UltraSPARC III Cu 8 processors server.

[1]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[2]  Hironori Kasahara,et al.  Near fine grain parallel processing using static scheduling on single chip multiprocessors , 1999, Innovative Architecture for Future Generation High-Performance Processors and Systems (Cat. No.PR00650).

[3]  Hironori Kasahara,et al.  Coarse Grain Task Parallel Processing with Cache Optimization on Shared Memory Multiprocessor , 2001, LCPC.

[4]  David A. Padua,et al.  On the Automatic Parallelization of the Perfect Benchmarks , 1998, IEEE Trans. Parallel Distributed Syst..

[5]  Nancy M. Amato,et al.  Run-time methods for parallelizing partially parallel loops , 1995, ICS '95.

[6]  Hiroki Honda,et al.  A Compilation Scheme for Macro-Dataflow Computation on Hierarchical Multiprocessor Systems , 1990, ICPP.

[7]  Mary W. Hall,et al.  Interprocedural Parallelization Analysis: A Case Study , 1995, PPSC.

[8]  Hironori Kasahara,et al.  Automatic Coarse Grain Task Parallel Processing on SMP Using OpenMP , 2000, LCPC.

[9]  David A. Padua,et al.  Automatic Array Privatization , 1993, Compiler Optimizations for Scalable Parallel Systems Languages.

[10]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[11]  Monica S. Lam,et al.  Interprocedural Analysis for Parallelization , 1995, LCPC.

[12]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[13]  Constantine D. Polychronopoulos,et al.  Symbolic analysis for parallelizing compilers , 1996, TOPL.

[14]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[15]  Yunheung Paek,et al.  Unified Interprocedural Parallelism Detection , 2001, International Journal of Parallel Programming.

[16]  Hiroki Honda,et al.  Parallel processing of near fine grain tasks using static scheduling on OSCAR (optimally scheduled advanced multiprocessor) , 1990, Proceedings SUPERCOMPUTING '90.

[17]  Hironori Kasahara,et al.  Data Localization Using Loop Aligned Decomposition for Macro-Dataflow Processing , 1996, LCPC.

[18]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[19]  Eduard Ayguadé,et al.  Thread fork/join techniques for multi-level parallelism exploitation in NUMA multiprocessors , 1999, ICS '99.

[20]  Hironori Kasahara,et al.  Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding , 2003, LCPC.