Improving Locality for Adaptive Irregular Scientific Codes

Irregular scientific codes experience poor cache performance due to their memory access patterns. In this paper, we examine two issues for locality optimizations for irregular computations. First, we experimentally find locality optimization can improve performance for parallel codes, but is dependent on the parallelization techniques used. Second, we show locality optimization may be used to improve performance even for adaptive codes. We develop a cost model which can be employed to calculate an efficient optimization frequency; it may be applied dynamically instrumenting the program to measure execution time per time-step iteration. Our results are validated through experiments on three representative irregular scientific codes.

[1]  Joel H. Saltz,et al.  Run-Time Parallelization and Scheduling of Loops , 1991, IEEE Trans. Computers.

[2]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[3]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[4]  Joel H. Saltz,et al.  ICASE Report No . 92-12 / iVG / / ff 3 J / ICASE THE DESIGN AND IMPLEMENTATION OF A PARALLEL UNSTRUCTURED EULER SOLVER USING SOFTWARE PRIMITIVES , 2022 .

[5]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[6]  Vipin Kumar,et al.  Analysis of Multilevel Graph Partitioning , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[7]  Shang-Hua Teng,et al.  High performance Fortran for highly irregular problems , 1997, PPOPP '97.

[8]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[9]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[10]  Vivek Sarkar,et al.  Automatic selection of high-order transformations in the IBM XL FORTRAN compilers , 1997, IBM J. Res. Dev..

[11]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[12]  von Hanxledenreinhard D Newsletter #9 Handling Irregular Problems with Fortran D | a Preliminary Report Handling Irregular Problems with Fortran D | a Preliminary Report , 1993 .

[13]  Joel H. Saltz,et al.  Runtime and language support for compiling adaptive irregular programs on distributed‐memory machines , 1995, Softw. Pract. Exp..

[14]  Chau-Wen Tseng,et al.  Enhancing software DSM for compiler-parallelized applications , 1997, Proceedings 11th International Parallel Processing Symposium.

[15]  James R. Larus,et al.  Compiler-directed Shared-Memory Communication for Iterative Parallel Applications , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[16]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[17]  Emilio L. Zapata,et al.  A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors , 2000, ICS '00.

[18]  A. H. Sherman,et al.  Comparative Analysis of the Cuthill–McKee and the Reverse Cuthill–McKee Ordering Algorithms for Sparse Matrices , 1976 .

[19]  Ken Kennedy,et al.  GIVE-N-TAKE—a balanced code placement framework , 1994, PLDI '94.

[20]  Alan L. Cox,et al.  Compiler and software distributed shared memory support for irregular applications , 1997, PPOPP '97.

[21]  J. Mark Bull,et al.  Feedback Guided Dynamic Loop Scheduling: Algorithms and Experiments , 1998, Euro-Par.

[22]  Alok N. Choudhary,et al.  An efficient uniform run-time scheme for mixed regular-irregular applications , 1998, ICS '98.

[23]  Bo Lu,et al.  Compiler optimization of implicit reductions for distributed memory multiprocessors , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[24]  Ken Kennedy,et al.  Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[25]  Prithviraj Banerjee,et al.  Exploiting spatial regularity in irregular iterative applications , 1995, Proceedings of 9th International Parallel Processing Symposium.

[26]  G. Karypis,et al.  Multilevel k-way hypergraph partitioning , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[27]  Andrew B. Kahng,et al.  Recent directions in netlist partitioning , 1995 .

[28]  Toshio Nakatani,et al.  Detection and global optimization of reduction operations for distributed parallel machines , 1996, ICS '96.

[29]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[30]  James R. Larus,et al.  Optimizing communication in HPF programs on fine-grain distributed shared memory , 1997, PPOPP '97.

[31]  Chau-Wen Tseng,et al.  Improving compiler and run-time support for adaptive irregular codes , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[32]  Keshav Pingali,et al.  Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming , 1997, PPoPP 1997.

[33]  Shashi Shekhar,et al.  Partitioning Similarity Graphs: A Framework for Declustering Problems , 1996, Inf. Syst..

[34]  Shahid H. Bokhari,et al.  A Partitioning Strategy for Nonuniform Problems on Multiprocessors , 1987, IEEE Transactions on Computers.

[35]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[36]  Horst D. Simon,et al.  Partitioning of unstructured problems for parallel processing , 1991 .

[37]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[38]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[39]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[40]  William M. Pottenger,et al.  The role of associativity and commutativity in the detection and transformation of loop-level parallelism , 1998, ICS '98.

[41]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[42]  Todd C. Mowry,et al.  Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation , 1999, ISCA.

[43]  Shahid H. Bokhari,et al.  A Partitioning Strategy for PDEs Across Multiprocessors , 1985, ICPP.

[44]  Chau-Wen Tseng,et al.  A Comparison of Locality Transformations for Irregular Codes , 2000, LCR.

[45]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[46]  Alan L. Cox,et al.  An integrated compile-time/run-time software distributed shared memory system , 1996, ASPLOS VII.

[47]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[48]  Sanjay Ranka,et al.  Memory hierarchy management for iterative graph structures , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[49]  Joel H. Saltz,et al.  Dynamic Remapping of Parallel Computations with Varying Resource Demands , 1988, IEEE Trans. Computers.

[50]  Joel H. Saltz,et al.  Principles of runtime support for parallel processors , 1988, ICS '88.

[51]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[52]  Harry Berryman,et al.  Parallel Loops on Distributed Machines , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[53]  Alan L. Cox,et al.  Evaluating the performance of software distributed shared memory as a target for parallelizing compilers , 1997, Proceedings 11th International Parallel Processing Symposium.

[54]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[55]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[56]  David A. Padua,et al.  On the Automatic Parallelization of Sparse and Irregular Fortran Programs , 1998, LCR.

[57]  David A. Padua,et al.  Compiler analysis of irregular memory accesses , 2000, PLDI '00.

[58]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[59]  Ken Kennedy,et al.  Inter-array Data Regrouping , 1999, LCPC.

[60]  K. Kennedy,et al.  Automatic Data Layout for High Performance Fortran , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[61]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[62]  K. Kennedy,et al.  Preliminary experiences with the Fortran D compiler , 1993, Supercomputing '93.