FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
暂无分享,去创建一个
Yun Liang | Shuo Wang | Size Zheng | Kaiwen Sheng | Renze Chen | Yun Liang | Size Zheng | Renze Chen | Kaiwen Sheng | Shuo Wang
[1] Tianqi Chen,et al. XGBoost: A Scalable Tree Boosting System , 2016, KDD.
[2] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.
[3] Jason Cong,et al. High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[4] Jonathan Ragan-Kelley,et al. Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..
[5] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[6] Thierry Moreau,et al. Learning to Optimize Tensor Programs , 2018, NeurIPS.
[7] Jason Cong,et al. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing , 2019, FPGA.
[8] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[9] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[10] Qinru Qiu,et al. C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs , 2018, FPGA.
[11] Peter Dayan,et al. Q-learning , 1992, Machine Learning.
[12] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .
[13] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[14] R. C. Whaley,et al. ATLAS (Automatically Tuned Linear Algebra Software) , 2011, Encyclopedia of Parallel Computing.
[15] Shiguang Shan,et al. Shift-Net: Image Inpainting via Deep Feature Rearrangement , 2018, ECCV.
[16] Kurt Keutzer,et al. Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[17] Yutaka Satoh,et al. Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).
[18] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[19] Peter Dayan,et al. Technical Note: Q-Learning , 2004, Machine Learning.
[20] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[21] Frédo Durand,et al. Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..
[22] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.
[23] Shengen Yan,et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).
[24] Nikos D. Sidiropoulos,et al. Tensors for Data Mining and Data Fusion , 2016, ACM Trans. Intell. Syst. Technol..
[25] Xiang Zhang,et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.
[26] James A. Hendler,et al. Semantic Social Network Analysis by Cross-Domain Tensor Factorization , 2017, IEEE Transactions on Computational Social Systems.
[27] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[28] Mary W. Hall,et al. SWIRL: High-performance many-core CPU code generation for deep neural networks , 2019, Int. J. High Perform. Comput. Appl..
[29] Jeff Johnson,et al. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.
[30] Wei Zhang,et al. FlexCL: A Model of Performance and Power for OpenCL Workloads on FPGAs , 2018, IEEE Transactions on Computers.
[31] Karsten Schwan,et al. Leo: A Profile-Driven Dynamic Optimization Framework for GPU Applications , 2014, TRIOS.
[32] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.
[33] Yann LeCun,et al. Fast Training of Convolutional Networks through FFTs , 2013, ICLR.
[34] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[35] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[36] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.
[37] Liancheng Jia,et al. A coordinated tiling and batching framework for efficient GEMM on GPUs , 2019, PPoPP.
[38] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.
[39] Dongrui Fan,et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs , 2018, MICRO.
[40] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.
[41] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[42] Xiuhong Li,et al. CuLDA: Solving Large-scale LDA Problems on GPUs , 2019, HPDC.
[43] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[44] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Thierry Moreau,et al. Relay: A High-Level IR for Deep Learning , 2019, ArXiv.
[46] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[47] Uday Bondhugula,et al. PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.
[48] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[49] C. D. Gelatt,et al. Optimization by Simulated Annealing , 1983, Science.
[50] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..
[51] Gu-Yeon Wei,et al. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[52] Shih-Fu Chang,et al. An Exploration of Parameter Redundancy in Deep Networks with Circulant Projections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[53] Sanjay V. Rajopadhye,et al. Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[54] Jing Li,et al. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.
[55] Elizabeth R. Jessup,et al. Automating the generation of composed linear algebra kernels , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.
[56] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[57] Andrew Lavin,et al. Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[58] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Torsten Hoefler,et al. FBLAS: Streaming Linear Algebra on FPGA , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[60] Abien Fred Agarap. Deep Learning using Rectified Linear Units (ReLU) , 2018, ArXiv.
[61] Yun Liang,et al. SpWA: An Efficient Sparse Winograd Convolutional Neural Networks Accelerator on FPGAs , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).
[62] Alexander Heinecke,et al. Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.