Forma: A framework for safe automatic array reshaping

This article presents Forma, a practical, safe, and automatic data reshaping framework that reorganizes arrays to improve data locality. Forma splits large aggregated data-types into smaller ones to improve data locality. Arrays of these large data types are then replaced by multiple arrays of the smaller types. These new arrays form natural data streams that have smaller memory footprints, better locality, and are more suitable for hardware stream prefetching. Forma consists of a field-sensitive alias analyzer, a data type checker, a portable structure reshaping planner, and an array reshaper. An extensive experimental study compares different data reshaping strategies in two dimensions: (1) how the data structure is split into smaller ones (maximal partition × frequency-based partition × affinity-based partition); and (2) how partitioned arrays are linked to preserve program semantics (address arithmetic-based reshaping × pointer-based reshaping). This study exposes important characteristics of array reshaping. First, a practical data reshaper needs not only an inter-procedural analysis but also a data-type checker to make sure that array reshaping is safe. Second, the performance improvement due to array reshaping can be dramatic: standard benchmarks can run up to 2.1 times faster after array reshaping. Array reshaping may also result in some performance degradation for certain benchmarks. An extensive micro-architecture-level performance study identifies the causes for this degradation. Third, the seemingly naive maximal partition achieves best or close-to-best performance in the benchmarks studied. This article presents an analysis that explains this surprising result. Finally, address-arithmetic-based reshaping always performs better than its pointer-based counterpart.

[1]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[2]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[3]  Bjarne Steensgaard,et al.  Points-to analysis in almost linear time , 1996, POPL '96.

[4]  José Nelson Amaral,et al.  Crafting Data Structures: A Study of Reference Locality in Refinement-Based Pathfinding , 2003, HiPC.

[5]  Per Stenström,et al.  A prefetching technique for irregular accesses to linked data structures , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[6]  Ulrich Kremer,et al.  A Stable and Efficient Loop Tiling Algorithm , 1999 .

[7]  Daniel A. Connors,et al.  Compiler-directed content-aware prefetching for dynamic data structures , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[8]  Mark N. Wegman,et al.  Analysis of pointers and structures , 1990, SIGP.

[9]  Clemens Grelck,et al.  With-Loop Fusion for Data Locality and Parallelism , 2005, IFL.

[10]  Ken Kennedy Fast greedy weighted fusion , 2000, ICS '00.

[11]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[12]  Krishna V. Palem,et al.  Design space optimization of embedded memory systems via data remapping , 2002, LCTES/SCOPES '02.

[13]  Todd C. Mowry,et al.  Optimizing the cache performance of non-numeric applications , 2000 .

[14]  José Nelson Amaral,et al.  A performance study of data layout techniques for improving data locality in refinement-based pathfinding , 2004, JEAL.

[15]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[16]  Krishna V. Palem,et al.  Data remapping for design space optimization of embedded memory systems , 2003, TECS.

[17]  Thomas W. Reps,et al.  Pointer analysis for programs with structures and casting , 1999, PLDI '99.

[18]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[19]  George C. Necula,et al.  CCured in the real world , 2003, PLDI '03.

[20]  Vikram S. Adve,et al.  Automatic pool allocation for disjoint data structures , 2002, MSP/ISMM.

[21]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[22]  Michael Hind,et al.  Which pointer analysis should I use? , 2000, ISSTA '00.

[23]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[24]  George C. Necula,et al.  CCured: type-safe retrofitting of legacy code , 2002, POPL '02.

[25]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[26]  Bjarne Steensgaard Points-to Analysis by Type Inference of Programs with Structures and Unions , 1996, CC.

[27]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[28]  Hironori Kasahara,et al.  Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding , 2003, LCPC.

[29]  Vikram S. Adve,et al.  Automatic pool allocation for disjoint data structures , 2003, MSP '02.

[30]  Robert C. Holte,et al.  Speeding up Problem Solving by Abstraction: A Graph Oriented Approach , 1996, Artif. Intell..

[31]  Larry Carter,et al.  Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[32]  Guang R. Gao,et al.  Speculative Prefetching of Induction Pointers , 2001, CC.

[33]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[34]  M. Franz,et al.  Splitting Data Objects to Increase Cache Utilization ( Preliminary Version , 9 th October 1998 ) , 1998 .

[35]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[36]  Kathryn S. McKinley,et al.  A Parametrized Loop Fusion Algorithm for Improving Parallelism and Cache Locality , 1997, Comput. J..

[37]  Barbara G. Ryder Dimensions of Precision in Reference Analysis of Object-Oriented Programming Languages , 2003, CC.

[38]  Laurie J. Hendren,et al.  Context-sensitive interprocedural points-to analysis in the presence of function pointers , 1994, PLDI '94.

[39]  Donald Yeung,et al.  Evaluating the impact of memory system performance on software prefetching and locality optimizations , 2001, ICS '01.

[40]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[41]  Ken Kennedy,et al.  Automatic loop interchange , 2004, SIGP.

[42]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[43]  Ken Kennedy,et al.  Fast Greedy Weighted Fusion , 2000, ICS '00.

[44]  Herbert Schildt,et al.  The annotated ANSI C Standard American National Standard for Programming Languages—C: ANSI/ISO 9899-1990 , 1990 .