Data layout optimization techniques for modern and emerging architectures

The never-ending pursuit of higher performance is one fundamental driving force of computer science research. Although the semiconductor industry has fulfilled Moore’s Law over the last forty years by doubling transistor density every two years, the effectiveness of hardware advances cannot be fully exploited due to the mismatch between the architectural environment and the user program. Program optimization is a key to bridge this gap. In this dissertation, instead of restructuring programs’ control flow as in many previous efforts, we have applied several new data layout optimization techniques to answer many optimization challenges on modern and emerging architectures. In particular, the developed techniques and their unique contributions are as follows. (1) We describe an approach where a class of computations is modeled in terms of constituent operations that are empirically measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to perform data layout optimization in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. (2) To obtain a highly optimized index permutation library for dynamic data layout optimization, we develop an integrated optimization framework that addresses a number of issues including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors. A judicious combination of analytical and empirical approaches is used to determine the most appropriate optimizations. (3) With increasing numbers of cores, future CMPs (Chip Multi-Processors) are likely to have a tiled architecture with a portion of shared L2 cache on each tile and a bank-interleaved distribution of the address space. Although such an organization is effective for avoiding access hot-spots, it can cause a significant number of non-local L2 accesses for many commonly occurring regular data access patterns. We develop a compile-time framework for data locality optimization via data layout transformation. Using a polyhedral model, the program’s localizability is determined by analysis of its index set and array reference functions, followed by non-canonical data layout transformation to reduce non-local accesses for localizable computations. (4) We leverage software and operating system utilities to identify locality patterns of data objects and allocate them accordingly with different priorities in caches. This data object locality guided caching strategy is mainly designed to address the inability of LRU replacement to effectively handle memory intensive programs with weak locality (such as streaming accesses) and contention among strong locality data objects in caches, so that sub-optimal replacement decisions can be avoided. To achieve our goal, we present a system software framework. We first collect object-relative reuse distance histograms and inter-object interference histograms via memory trace sampling. With several low-cost training runs, we are able to determine the locality patterns of data objects. For the actual runs, we categorize data objects into different locality types and partition the cache space among data objects with a heuristic algorithm, in order to reduce cache misses through segregation of contending objects. The object-level cache partitioning framework has been implemented through modification of a Linux kernel.

[1]  Martin F. Berman A Method for Transposing a Matrix , 1958, JACM.

[2]  Todd C. Mowry,et al.  Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[3]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[4]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[5]  Antonio Gonzalez,et al.  A data cache with multiple caching strategies tuned to different types of locality , 1995, International Conference on Supercomputing.

[6]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[7]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[9]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[10]  Bharadwaj S. Amrutur,et al.  Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11]  Monica S. Lam,et al.  A data locality optimizing algorithm (with retrospective) , 1991 .

[12]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[13]  Mahmut T. Kandemir,et al.  A Layout-Conscious Iteration Space Transformation Technique , 2001, IEEE Trans. Computers.

[14]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[15]  David A. Padua,et al.  Estimating cache misses and locality using stack distances , 2003, ICS '03.

[16]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[17]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[18]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[19]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[20]  Ken Kennedy,et al.  Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries , 2001, J. Parallel Distributed Comput..

[21]  Ken Kennedy,et al.  Automatic decomposition of scientific programs for parallel execution , 1987, POPL '87.

[22]  Jack J. Dongarra,et al.  A proposal for a set of level 3 basic linear algebra subprograms , 1987, SGNM.

[23]  Peter Kogge,et al.  Generation of permutations for SIMD processors , 2005, LCTES '05.

[24]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[25]  Peter Davies,et al.  The TLB slice-a low-cost high-speed address translation mechanism , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[26]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[27]  Yale N. Patt,et al.  Utility-Based Cache Partitioning , 2006 .

[28]  Larry Carter,et al.  Towards an optimal bit-reversal permutation program , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[29]  Fred G. Gustavson,et al.  LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[30]  Mahmut T. Kandemir,et al.  Organizing the last line of defense before hitting the memory wall for CMPs , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[31]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[32]  Srihari Makineni,et al.  Communist, Utilitarian, and Capitalist cache policies on CMPs: Caches as a shared resource , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[34]  Chau-Wen Tseng,et al.  Eliminating conflict misses for high performance architectures , 1998, ICS '98.

[35]  Anoop Gupta,et al.  Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.

[36]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[37]  Albert Cohen,et al.  Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.

[38]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[39]  Jaehyuk Huh,et al.  A NUCA substrate for flexible CMP cache sharing , 2005, ICS.

[40]  William Pugh,et al.  Minimizing communication while preserving parallelism , 1996, ICS '96.

[41]  Scott A. Mahlke,et al.  Compiler-managed partitioned data caches for low power , 2007, LCTES '07.

[42]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[43]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[44]  Margo I. Seltzer,et al.  Operating system benchmarking in the wake of lmbench: a case study of the performance of NetBSD on the Intel x86 architecture , 1997, SIGMETRICS '97.

[45]  J. Ramanujam,et al.  Global communication optimization for tensor contraction expressions under memory constraints , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[46]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.

[47]  Mahmut T. Kandemir,et al.  Reducing NoC energy consumption through compiler-directed channel voltage scaling , 2006, PLDI '06.

[48]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[49]  James R. Larus,et al.  Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.

[50]  Margo I. Seltzer,et al.  Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design , 2005, USENIX Annual Technical Conference, General Track.

[51]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[52]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[53]  M. Naderi Think globally... , 2004, HIV prevention plus!.

[54]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[55]  Gang Ren,et al.  Optimizing data permutations for SIMD devices , 2006, PLDI '06.

[56]  David E. Bernholdt,et al.  A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[57]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[58]  Yannis Smaragdakis,et al.  Adaptive Caches: Effective Shaping of Cache Behavior to Workloads , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[59]  Susan Laflin,et al.  Algorithm 380: in-situ transposition of a rectangular matrix [F1] , 1970, CACM.

[60]  John R. Gilbert,et al.  Automatic array alignment in data-parallel programs , 1993, POPL '93.

[61]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global and dynamic cache reuse , 2000 .

[62]  Mary W. Hall,et al.  Custom data layout for memory parallelism , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[63]  Monica S. Lam,et al.  Maximizing Parallelism and Minimizing Synchronization with Affine Partitions , 1998, Parallel Comput..

[64]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[65]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[66]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[67]  Rick Kufrin Measuring and improving application performance with PerfSuite , 2005 .

[68]  P. Feautrier Parametric integer programming , 1988 .

[69]  Larry Carter,et al.  Memory hierarchy considerations for fast transpose and bit-reversals , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[70]  Eduard Ayguadé,et al.  A case for user-level dynamic page migration , 2000, ICS '00.

[71]  Won-Taek Lim,et al.  Architectural support for operating system-driven CMP cache management , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[72]  William Pugh,et al.  Constraint-based array dependence analysis , 1998, TOPL.

[73]  Li Shang,et al.  Dynamic voltage scaling with links for power optimization of interconnection networks , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[74]  Sarita V. Adve,et al.  Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[75]  Qiang Wu,et al.  Exposing memory access regularities using object-relative memory profiling , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[76]  R. C. Whaley,et al.  Timing high performance kernels through empirical compilation , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[77]  Zhao Zhang,et al.  Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors , 2000, SIAM J. Sci. Comput..

[78]  Henry G. Dietz,et al.  Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation , 1991, LCPC.

[79]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[80]  Saman P. Amarasinghe,et al.  Maps: a compiler-managed memory system for Raw machines , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[81]  Keshav Pingali,et al.  Automatic measurement of memory hierarchy parameters , 2005, SIGMETRICS '05.

[82]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[83]  Gustavo E. Scuseria,et al.  Achieving Chemical Accuracy with Coupled-Cluster Theory , 1995 .

[84]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[85]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[86]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[87]  Mark D. Hill,et al.  Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[88]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[89]  Chris H. Q. Ding,et al.  An Optimal Index Reshuffle Algorithm for Multidimensional Arrays and Its Applications for Parallel Architectures , 2001, IEEE Trans. Parallel Distributed Syst..

[90]  Mahmut T. Kandemir,et al.  Profile-driven energy reduction in network-on-chips , 2007, PLDI '07.

[91]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[92]  Siddhartha Chatterjee,et al.  Cache-efficient matrix transposition , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[93]  Michael Stumm,et al.  Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[94]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[95]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[96]  Chen Ding,et al.  Miss Rate Prediction Across Program Inputs and Cache Configurations , 2007, IEEE Transactions on Computers.

[97]  Bradford L. Chamberlain,et al.  ZPL: A Machine Independent Programming Language for Parallel Computers , 2000, IEEE Trans. Software Eng..