Reactive tiling

To fully exploit the power of emerging multicore architectures, managing shared resources (i.e., caches) across applications and over time is critical. However, to our knowledge, most prior efforts view this problem from the OS/hardware side, and do not consider whether applications themselves can also participate in this process of managing shared resources. In this paper, we show how an application can react to OS/hardware-based resource management decisions by adapting itself (called reactive application), with the goal of maximizing the utilization of the shared resources allocated to it. Specifically, we present a framework that can generate code for adaptive (reactive) tiling, and propose an execution model in which a reactive application can react to the modulations in its cache space allocations to prevent its performance from degrading. One can expect two potential benefits from this approach. First, matching tile size to available cache capacity dynamically (during execution) improves performance of the target application. Second and equally important, better utilization of shared cache space reduces pressure on other applications (co-runners) that execute concurrently with the target application. Our experimental results show that the proposed scheme improves the performance of applications (over the best static tiles) by 8.4%, on average, when using synthetic cache allocations. Further with dynamic cache allocations determined by the utility-based cache partitioning (a state-of-the-art cache partitioning scheme), it improves performance of a set of eleven HPC applications by 11.3%.

[1]  Per Stenström,et al.  An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[2]  Ling Shao,et al.  DMATiler: Revisiting loop tiling for direct memory access , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[4]  James Demmel,et al.  Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[5]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[6]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[7]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[8]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[9]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[10]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[11]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[12]  J. Ramanujam,et al.  Adaptive parallel tiled code generation and accelerated auto-tuning , 2013, Int. J. High Perform. Comput. Appl..

[13]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[14]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[15]  Chen Ding,et al.  Defensive loop tiling for shared cache , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[16]  G. Edward Suh,et al.  Dynamic Cache Partitioning for Simultaneous Multithreading Systems , 2004 .

[17]  Vivek Sarkar,et al.  An analytical model for loop tiling and its solution , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[18]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[19]  Hui Wu,et al.  Parallelizing SOR for GPGPUs using alternate loop tiling , 2012, Parallel Comput..

[20]  G. Edward Suh,et al.  Analytical cache models with applications to cache partitioning , 2001, ICS '01.

[21]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[22]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[23]  J. Ramanujam,et al.  Parameterized tiling revisited , 2010, CGO '10.

[24]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[25]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[26]  Srihari Makineni,et al.  Communist, Utilitarian, and Capitalist cache policies on CMPs: Caches as a shared resource , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[28]  John Turek,et al.  Optimal Partitioning of Cache Memory , 1992, IEEE Trans. Computers.

[29]  Engin Ipek,et al.  Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[30]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[31]  Mikel Luján,et al.  Adaptive Loop Tiling for a Multi-cluster CMP , 2008, ICA3PP.

[32]  Dimitrios S. Nikolopoulos Dynamic tiling for effective use of shared caches on multithreaded processors , 2004, Int. J. High Perform. Comput. Netw..

[33]  Cédric Bastoul,et al.  Switchable Scheduling for Runtime Adaptation of Optimization , 2014, Euro-Par.