暂无分享,去创建一个
Wei Lin | Guoping Long | Jun Yang | Guoping Long | Wei Lin | Jun Yang
[1] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.
[2] Matthias Korch,et al. Accelerating explicit ODE methods on GPUs by kernel fusion , 2018, Concurr. Comput. Pract. Exp..
[3] Sudhakar Yalamanchili,et al. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[4] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global cache reuse , 2004, J. Parallel Distributed Comput..
[5] Wenguang Chen,et al. VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[6] Lidong Zhou,et al. Astra: Exploiting Predictability to Optimize Deep Learning , 2019, ASPLOS.
[7] Shirish Tatikonda,et al. On optimizing machine learning workloads via kernel fusion , 2015, PPoPP.
[8] Jürgen Teich,et al. Automatic Kernel Fusion for Image Processing DSLs , 2018, SCOPES.
[9] V. Sarkar,et al. Collective Loop Fusion for Array Contraction , 1992, LCPC.
[10] Jürgen Teich,et al. From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[11] Mohamed Wahib,et al. Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] António Branco,et al. Attention Focusing for Neural Machine Translation by Bridging Source and Target Embeddings , 2017, ACL.
[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[15] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[16] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.
[17] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .
[18] Matthew W. Moskewicz,et al. Boda: A Holistic Approach for Implementing Neural Network Computations , 2017, Conf. Computing Frontiers.
[19] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Frédo Durand,et al. Halide , 2017, Commun. ACM.
[21] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[22] Ken Kennedy,et al. AS imple, Fast Dominance Algorithm , 1999 .
[23] Hai Liu,et al. Latte: a language, compiler, and runtime for elegant and efficient deep neural networks , 2016, PLDI.
[24] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[25] Dik Lun Lee,et al. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba , 2018, KDD.
[26] Sridhar Radhakrishnan,et al. Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs , 2015, ArXiv.
[27] David Gregg,et al. Optimal DNN primitive selection with partitioned boolean quadratic programming , 2018, CGO.
[28] Paolo Bientinesi,et al. Program generation for small-scale linear algebra applications , 2018, CGO.