暂无分享,去创建一个
Guoping Long | Lansong Diao | Wei Lin | Pengzhan Zhao | Feiwen Zhu | Wenyi Zhao | Jun Yang | Kai Zhu | Zhen Zheng | Zhen Zheng | Guoping Long | Lansong Diao | Wei Lin | Jun Yang | Pengzhan Zhao | Wenyi Zhao | Kai Zhu | Feiwen Zhu
[1] Sriram Krishnamoorthy,et al. A Code Generator for High-Performance Tensor Contractions on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[2] Shoaib Kamil,et al. Tiramisu : A Polyhedral Compiler with A Scheduling Language for Targeting High Performance Systems Riyadh , 2018 .
[3] Timothy J. Harvey,et al. AS imple, Fast Dominance Algorithm , 1999 .
[4] Marco D. Santambrogio,et al. DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[5] Mohamed Wahib,et al. Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[6] Sridhar Radhakrishnan,et al. Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs , 2015, ArXiv.
[7] Hai Liu,et al. Latte: a language, compiler, and runtime for elegant and efficient deep neural networks , 2016, PLDI.
[8] Phil Blunsom,et al. Optimizing Performance of Recurrent Neural Networks on GPUs , 2016, ArXiv.
[9] Bertrand A. Maher,et al. Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.
[10] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[11] Ankur Bapna,et al. Faster Transformer Decoding: N-gram Masked Self-Attention , 2020, ArXiv.
[12] Shirish Tatikonda,et al. On optimizing machine learning workloads via kernel fusion , 2015, PPoPP.
[13] Cody Hao Yu,et al. Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.
[14] Alexander Aiken,et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.
[15] Lidong Zhou,et al. Astra: Exploiting Predictability to Optimize Deep Learning , 2019, ASPLOS.
[16] Yun Liang,et al. An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[17] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.
[18] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.
[20] Thierry Moreau,et al. Automatic generation of high-performance quantized machine learning kernels , 2020, CGO.
[21] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[22] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[23] Marco Maggioni,et al. Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.
[24] Wenguang Chen,et al. VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[25] Ao Li,et al. Automatic Horizontal Fusion for GPU Kernels , 2020, 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[26] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[27] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.
[28] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[30] Rui Mao,et al. A Performance Model for GPU Architectures that Considers On-Chip Resources: Application to Medical Image Registration , 2019, IEEE Transactions on Parallel and Distributed Systems.
[31] Jürgen Teich,et al. From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[32] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[33] David Furcy,et al. Limited Discrepancy Beam Search , 2005, IJCAI.
[34] Dik Lun Lee,et al. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba , 2018, KDD.
[35] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .
[36] V. Sarkar,et al. Collective Loop Fusion for Array Contraction , 1992, LCPC.
[37] Jürgen Teich,et al. Automatic Kernel Fusion for Image Processing DSLs , 2018, SCOPES.
[38] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[39] John W. Merrill,et al. Automatic Speech Recognition , 2005 .
[40] Ken Kennedy,et al. AS imple, Fast Dominance Algorithm , 1999 .
[41] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.
[42] Xiang Bai,et al. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[43] Chang Zhou,et al. Deep Interest Evolution Network for Click-Through Rate Prediction , 2018, AAAI.
[44] Mattan Erez,et al. DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[45] Paolo Bientinesi,et al. Program generation for small-scale linear algebra applications , 2018, CGO.
[46] Sudhakar Yalamanchili,et al. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[47] Matei Zaharia,et al. Optimizing DNN Computation with Relaxed Graph Substitutions , 2019, MLSys.
[48] Erich Elsen,et al. Persistent RNNs: Stashing Recurrent Weights On-Chip , 2016, ICML.
[49] Yanqi Zhou,et al. A Learned Performance Model for the Tensor Processing Unit , 2020, ArXiv.
[50] Zhiqiang Xie,et al. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.
[51] David Gregg,et al. Optimal DNN primitive selection with partitioned boolean quadratic programming , 2018, CGO.
[52] Matthias Korch,et al. Accelerating explicit ODE methods on GPUs by kernel fusion , 2018, Concurr. Comput. Pract. Exp..
[53] Matthew W. Moskewicz,et al. Boda: A Holistic Approach for Implementing Neural Network Computations , 2017, Conf. Computing Frontiers.
[54] Mingyu Chen,et al. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.