DNNFusion: accelerating deep neural networks execution with advanced operator fusion
暂无分享,去创建一个
Gagan Agrawal | Wei Niu | Yanzhi Wang | Jiexiong Guan | Bin Ren | Yanzhi Wang | G. Agrawal | Wei Niu | Bin Ren | Jiexiong Guan
[1] Uday Bondhugula,et al. MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.
[2] Richard Veras,et al. When polyhedral transformations meet SIMD code generation , 2013, PLDI.
[3] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[4] Jianguo Zhang,et al. The PASCAL Visual Object Classes Challenge , 2006 .
[5] Yafeng Yang,et al. MNN: A Universal and Efficient Inference Engine , 2020, MLSys.
[6] Cortex: A Compiler for Recursive Deep Learning Models , 2020, ArXiv.
[7] Alain Darte. On the Complexity of Loop Fusion , 2000, Parallel Comput..
[8] Thierry Moreau,et al. Learning to Optimize Tensor Programs , 2018, NeurIPS.
[9] Peter Ahrens,et al. Tensor Algebra Compilation with Workspaces , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[10] Shoaib Kamil,et al. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[11] Hong-Yuan Mark Liao,et al. YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.
[12] Manuel Selva,et al. Building a Polyhedral Representation from an Instrumented Execution , 2020, ACM Trans. Archit. Code Optim..
[13] Yanzhi Wang,et al. PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning , 2020, ASPLOS.
[14] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.
[15] Emmanuel Kounalis,et al. Compilation of Pattern Matching with Associative-Commutative Functions , 1991, TAPSOFT, Vol.1.
[16] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[17] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[18] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[19] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.
[20] Alexander Aiken,et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.
[21] Mahmut T. Kandemir,et al. Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.
[22] Uday Bondhugula,et al. A model for fusion and code motion in an automatic parallelizing compiler , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[23] Keith D. Cooper,et al. Operator strength reduction , 2001, TOPL.
[24] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[25] Markus Püschel,et al. A Basic Linear Algebra Compiler , 2014, CGO '14.
[26] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[27] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[28] Yifan Gong,et al. RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).
[29] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[30] Christophe Dubach,et al. Automatic generation of specialized direct convolutions for mobile GPUs , 2020, GPGPU@PPoPP.
[31] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[32] Paliath Narendran,et al. Complexity of Matching Problems , 1987, J. Symb. Comput..
[33] Vivek Sarkar,et al. Optimal weighted loop fusion for parallel programs , 1997, SPAA '97.
[34] Manuel Krebber,et al. Non-linear Associative-Commutative Many-to-One Pattern Matching with Sequence Variables , 2017, ArXiv.
[35] Luc Van Gool,et al. The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.
[36] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.
[37] Uday Bondhugula,et al. Loop transformations: convexity, pruning and optimization , 2011, POPL '11.
[38] Shoaib Kamil,et al. The tensor algebra compiler , 2017, Proc. ACM Program. Lang..
[39] Vivek Sarkar,et al. Polyhedral Optimizations of Explicitly Parallel Programs , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[40] Vivek Seshadri,et al. Compiling KB-sized machine learning models to tiny IoT devices , 2019, PLDI.
[41] Shirish Tatikonda,et al. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning , 2017, CIDR.
[42] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[43] Berthold Reinwald,et al. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML , 2018, Proc. VLDB Endow..
[44] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[45] Uday Bondhugula,et al. Polyhedral auto-transformation with no integer linear programming , 2018, PLDI.
[46] Albert Cohen,et al. Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.
[47] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[48] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[49] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.
[50] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[51] Zhiqiang Xie,et al. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.
[52] Mary M. Rodgers,et al. Recent Advances in Wearable Sensors for Health Monitoring , 2015, IEEE Sensors Journal.
[53] Mary W. Hall,et al. Non-affine Extensions to Polyhedral Code Generation , 2014, CGO '14.
[54] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.
[55] Saumya K. Debray,et al. Unfold/fold transformations and loop optimization of logic programs , 1988, PLDI '88.
[56] Markus Püschel,et al. A basic linear algebra compiler for embedded processors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[57] CohenAlbert,et al. DNNFusion: accelerating deep neural networks execution with advanced operator fusion , 2020, ACM Trans. Archit. Code Optim..
[58] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[59] Benoît Meister,et al. Polyhedral Optimization of TensorFlow Computation Graphs , 2017, ESPT/VPA@SC.
[60] Yiming Yang,et al. MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer , 2019 .
[61] Sebastian Hack,et al. Polly's Polyhedral Scheduling in the Presence of Reductions , 2015, ArXiv.
[62] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[63] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.
[64] Mary W. Hall,et al. Loop and data transformations for sparse matrix code , 2015, PLDI.
[65] Nicholas D. Lane,et al. From smart to deep: Robust activity recognition on smartwatches using deep learning , 2016, 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops).
[66] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.