论文信息 - Revisiting Loop Tiling for Datacenters: Live and Let Live

Revisiting Loop Tiling for Datacenters: Live and Let Live

As DNNs gain popularity in modern datacenters, it becomes imperative to revisit compiler optimizations for DNNs in a colocation scenario. Loop tiling turns out to be the most significant compiler optimization, since DNNs typically apply a series of matrix computations iteratively to a massive amount of data. We introduce a reuse-pattern-centric approach to obtaining a peer-aware TSS (Tile Size Selection) model for a matrix-based application A. Our key insight is that the co-running cache behavior of A (once tiled) can be determined by its data reuse patterns, together with the cache pressure exerted by its co-running peers, without actually the need for analyzing the code of its co-runners. Compared with static tiling (that determines a tile size for A statically without considering its co-running peers), our peer-aware tiling enables compilers to generate either faster peer-aware efficient code for A (by optimizing the performance of A) or faster peer-aware nice code for A (by optimizing the performance of its co-runners). In addition, our peer-aware tiling also enables library developers to improve the performance of library routines (more effectively than static tiling).

[1] J. Ramanujam,et al. DynTile: Parametric tiled loop generation for parallel execution on multicore processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2] David A. Wood,et al. Reuse-based online models for caches , 2013, SIGMETRICS '13.

[3] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[4] Gang Ren,et al. A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[5] John Turek,et al. Optimal Partitioning of Cache Memory , 1992, IEEE Trans. Computers.

[6] Lingjia Tang,et al. Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[8] Jichuan Chang,et al. Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[9] Zhao Zhang,et al. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[10] Onur Mutlu,et al. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12] Jingling Xue,et al. On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[13] William Jalby,et al. A Quantitative Algorithm for Data Locality Optimization , 1991, Code Generation.

[14] Wei Wang,et al. ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers , 2013, ASPLOS '13.

[15] Abid M. Malik. Optimal Tile Size Selection Problem Using Machine Learning , 2012, 2012 11th International Conference on Machine Learning and Applications.

[16] Xiaobing Feng,et al. Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis , 2016, IEEE Transactions on Parallel and Distributed Systems.

[17] Victor Eijkhout,et al. Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[18] Sanjay V. Rajopadhye,et al. Parameterized tiled loops for free , 2007, PLDI '07.

[19] Luiz André Barroso,et al. The tail at scale , 2013, CACM.

[20] Shengmei Li,et al. Hadoop+: Modeling and Evaluating the Heterogeneity for MapReduce Applications in Heterogeneous Clusters , 2015, ICS.

[21] Onur Mutlu,et al. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS 2010.

[22] P. Sadayappan,et al. Neural Network Assisted Tile Size Selection , 2010 .

[23] Quan Chen,et al. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[24] G. Edward Suh,et al. Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[25] Christoforos E. Kozyrakis,et al. Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[26] Minyi Guo,et al. Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[27] Pen-Chung Yew,et al. Tile size selection revisited , 2013, ACM Trans. Archit. Code Optim..

[28] Mahmut T. Kandemir,et al. Reactive tiling , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[29] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[30] Hao Luo,et al. HOTL: a higher order theory of locality , 2013, ASPLOS '13.

[31] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.

[32] Jingling Xue,et al. Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[33] Lingjia Tang,et al. Compiling for niceness: mitigating contention for QoS in warehouse scale computers , 2012, CGO '12.

[34] Yan Solihin,et al. Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[35] Sriram Krishnamoorthy,et al. Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[36] Jingling Xue,et al. Reuse-Driven Tiling for Improving Data Locality , 1998, International Journal of Parallel Programming.

[37] Stijn Eyerman,et al. Per-thread cycle accounting in multicore processors , 2013, TACO.

[38] J. Ramanujam,et al. Parameterized tiling revisited , 2010, CGO '10.

[39] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[40] Dongrui Fan,et al. Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).

[41] Rajat Garg,et al. TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes , 2016, ICS.

[42] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[43] Keshav Pingali,et al. Think globally, search locally , 2005, ICS '05.

[44] José Duato,et al. An empirical model for predicting cross-core performance interference on multicore processors , 2013, PACT 2013.

[45] Lingjia Tang,et al. Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[46] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[47] Jingling Xue. Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[48] Gang Ren,et al. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[49] Ken Kennedy,et al. Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[50] Chen Ding,et al. Defensive loop tiling for shared cache , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[51] Yang Yang,et al. Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[52] Tomofumi Yuki,et al. Automatic creation of tile size selection models , 2010, CGO '10.

[53] Chun Chen,et al. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.

[54] Sanjay V. Rajopadhye,et al. Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).