Revisiting Loop Tiling for Datacenters: Live and Let Live

As DNNs gain popularity in modern datacenters, it becomes imperative to revisit compiler optimizations for DNNs in a colocation scenario. Loop tiling turns out to be the most significant compiler optimization, since DNNs typically apply a series of matrix computations iteratively to a massive amount of data. We introduce a reuse-pattern-centric approach to obtaining a peer-aware TSS (Tile Size Selection) model for a matrix-based application A. Our key insight is that the co-running cache behavior of A (once tiled) can be determined by its data reuse patterns, together with the cache pressure exerted by its co-running peers, without actually the need for analyzing the code of its co-runners. Compared with static tiling (that determines a tile size for A statically without considering its co-running peers), our peer-aware tiling enables compilers to generate either faster peer-aware efficient code for A (by optimizing the performance of A) or faster peer-aware nice code for A (by optimizing the performance of its co-runners). In addition, our peer-aware tiling also enables library developers to improve the performance of library routines (more effectively than static tiling).

[1]  J. Ramanujam,et al.  DynTile: Parametric tiled loop generation for parallel execution on multicore processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  David A. Wood,et al.  Reuse-based online models for caches , 2013, SIGMETRICS '13.

[3]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[4]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[5]  John Turek,et al.  Optimal Partitioning of Cache Memory , 1992, IEEE Trans. Computers.

[6]  Lingjia Tang,et al.  Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[8]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[9]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[10]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[13]  William Jalby,et al.  A Quantitative Algorithm for Data Locality Optimization , 1991, Code Generation.

[14]  Wei Wang,et al.  ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers , 2013, ASPLOS '13.

[15]  Abid M. Malik Optimal Tile Size Selection Problem Using Machine Learning , 2012, 2012 11th International Conference on Machine Learning and Applications.

[16]  Xiaobing Feng,et al.  Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis , 2016, IEEE Transactions on Parallel and Distributed Systems.

[17]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[18]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[19]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[20]  Shengmei Li,et al.  Hadoop+: Modeling and Evaluating the Heterogeneity for MapReduce Applications in Heterogeneous Clusters , 2015, ICS.

[21]  Onur Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS 2010.

[22]  P. Sadayappan,et al.  Neural Network Assisted Tile Size Selection , 2010 .

[23]  Quan Chen,et al.  DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[24]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[25]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[26]  Minyi Guo,et al.  Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[27]  Pen-Chung Yew,et al.  Tile size selection revisited , 2013, ACM Trans. Archit. Code Optim..

[28]  Mahmut T. Kandemir,et al.  Reactive tiling , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[29]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[30]  Hao Luo,et al.  HOTL: a higher order theory of locality , 2013, ASPLOS '13.

[31]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[32]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[33]  Lingjia Tang,et al.  Compiling for niceness: mitigating contention for QoS in warehouse scale computers , 2012, CGO '12.

[34]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[35]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[36]  Jingling Xue,et al.  Reuse-Driven Tiling for Improving Data Locality , 1998, International Journal of Parallel Programming.

[37]  Stijn Eyerman,et al.  Per-thread cycle accounting in multicore processors , 2013, TACO.

[38]  J. Ramanujam,et al.  Parameterized tiling revisited , 2010, CGO '10.

[39]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[40]  Dongrui Fan,et al.  Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).

[41]  Rajat Garg,et al.  TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes , 2016, ICS.

[42]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[43]  Keshav Pingali,et al.  Think globally, search locally , 2005, ICS '05.

[44]  José Duato,et al.  An empirical model for predicting cross-core performance interference on multicore processors , 2013, PACT 2013.

[45]  Lingjia Tang,et al.  Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[46]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[47]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[48]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[49]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[50]  Chen Ding,et al.  Defensive loop tiling for shared cache , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[51]  Yang Yang,et al.  Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[52]  Tomofumi Yuki,et al.  Automatic creation of tile size selection models , 2010, CGO '10.

[53]  Chun Chen,et al.  Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[54]  Sanjay V. Rajopadhye,et al.  Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).