Revisiting Loop Tiling for Datacenters: Live and Let Live
暂无分享,去创建一个
Xiaobing Feng | Jingling Xue | Yalin Zhang | Huimin Cui | Jiacheng Zhao | Xiaobing Feng | Jingling Xue | Huimin Cui | Jiacheng Zhao | Yalin Zhang
[1] J. Ramanujam,et al. DynTile: Parametric tiled loop generation for parallel execution on multicore processors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[2] David A. Wood,et al. Reuse-based online models for caches , 2013, SIGMETRICS '13.
[3] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[4] Gang Ren,et al. A comparison of empirical and model-driven optimization , 2003, PLDI '03.
[5] John Turek,et al. Optimal Partitioning of Cache Memory , 1992, IEEE Trans. Computers.
[6] Lingjia Tang,et al. Continuous shape shifting: Enabling loop co-optimization via near-free dynamic code rewriting , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[7] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[8] Jichuan Chang,et al. Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.
[9] Zhao Zhang,et al. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.
[10] Onur Mutlu,et al. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[11] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[12] Jingling Xue,et al. On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..
[13] William Jalby,et al. A Quantitative Algorithm for Data Locality Optimization , 1991, Code Generation.
[14] Wei Wang,et al. ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers , 2013, ASPLOS '13.
[15] Abid M. Malik. Optimal Tile Size Selection Problem Using Machine Learning , 2012, 2012 11th International Conference on Machine Learning and Applications.
[16] Xiaobing Feng,et al. Predicting Cross-Core Performance Interference on Multicore Processors with Regression Analysis , 2016, IEEE Transactions on Parallel and Distributed Systems.
[17] Victor Eijkhout,et al. Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.
[18] Sanjay V. Rajopadhye,et al. Parameterized tiled loops for free , 2007, PLDI '07.
[19] Luiz André Barroso,et al. The tail at scale , 2013, CACM.
[20] Shengmei Li,et al. Hadoop+: Modeling and Evaluating the Heterogeneity for MapReduce Applications in Heterogeneous Clusters , 2015, ICS.
[21] Onur Mutlu,et al. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS 2010.
[22] P. Sadayappan,et al. Neural Network Assisted Tile Size Selection , 2010 .
[23] Quan Chen,et al. DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[24] G. Edward Suh,et al. Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.
[25] Christoforos E. Kozyrakis,et al. Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.
[26] Minyi Guo,et al. Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences , 2005, 2005 International Conference on Parallel Processing (ICPP'05).
[27] Pen-Chung Yew,et al. Tile size selection revisited , 2013, ACM Trans. Archit. Code Optim..
[28] Mahmut T. Kandemir,et al. Reactive tiling , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[29] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[30] Hao Luo,et al. HOTL: a higher order theory of locality , 2013, ASPLOS '13.
[31] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.
[32] Jingling Xue,et al. Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.
[33] Lingjia Tang,et al. Compiling for niceness: mitigating contention for QoS in warehouse scale computers , 2012, CGO '12.
[34] Yan Solihin,et al. Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.
[35] Sriram Krishnamoorthy,et al. Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.
[36] Jingling Xue,et al. Reuse-Driven Tiling for Improving Data Locality , 1998, International Journal of Parallel Programming.
[37] Stijn Eyerman,et al. Per-thread cycle accounting in multicore processors , 2013, TACO.
[38] J. Ramanujam,et al. Parameterized tiling revisited , 2010, CGO '10.
[39] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.
[40] Dongrui Fan,et al. Extendable pattern-oriented optimization directives , 2012, International Symposium on Code Generation and Optimization (CGO 2011).
[41] Rajat Garg,et al. TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes , 2016, ICS.
[42] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.
[43] Keshav Pingali,et al. Think globally, search locally , 2005, ICS '05.
[44] José Duato,et al. An empirical model for predicting cross-core performance interference on multicore processors , 2013, PACT 2013.
[45] Lingjia Tang,et al. Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[46] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[47] Jingling Xue. Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..
[48] Gang Ren,et al. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.
[49] Ken Kennedy,et al. Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.
[50] Chen Ding,et al. Defensive loop tiling for shared cache , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[51] Yang Yang,et al. Automatic Library Generation for BLAS3 on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[52] Tomofumi Yuki,et al. Automatic creation of tile size selection models , 2010, CGO '10.
[53] Chun Chen,et al. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy , 2005, International Symposium on Code Generation and Optimization.
[54] Sanjay V. Rajopadhye,et al. Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).