Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data
暂无分享,去创建一个
[1] Shoaib Kamil,et al. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[2] Tim Zerrell,et al. Stripe: Tensor Compilation via the Nested Polyhedral Model , 2019, ArXiv.
[3] Cédric Bastoul,et al. Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..
[4] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.
[5] Hariharan Sandanagobalane,et al. Diesel: DSL for linear algebra and neural net computations on GPUs , 2018, MAPL@PLDI.
[6] Sanjay V. Rajopadhye,et al. Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[7] Uday Bondhugula,et al. A model for fusion and code motion in an automatic parallelizing compiler , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[8] Albert Cohen,et al. A polyhedral compilation framework for loops with dynamic data-dependent bounds , 2018, CC.
[9] Jan Kautz,et al. Local Laplacian filters: edge-aware image processing with a Laplacian pyramid , 2011, ACM Trans. Graph..
[10] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[11] Albert Cohen,et al. Polyhedral AST Generation Is More Than Scanning Polyhedra , 2015, ACM Trans. Program. Lang. Syst..
[12] Albert Cohen,et al. Sub-polyhedral scheduling using (unit-)two-variable-per-inequality polyhedra , 2013, POPL.
[13] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[14] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.
[15] Mary W. Hall,et al. Loop and data transformations for sparse matrix code , 2015, PLDI.
[16] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[17] Uday Bondhugula,et al. An effective fusion and tile size model for optimizing image processing pipelines , 2018, PPoPP.
[18] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Sven Verdoolaege,et al. isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.
[21] Jan Bartovsky,et al. GPU implementation of linear morphological openings with arbitrary angle , 2012, Journal of Real-Time Image Processing.
[22] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.
[23] William Pugh,et al. Static analysis of upper and lower bounds on dependences and parallelism , 1994, TOPL.
[24] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.
[25] John L. Henning. SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.
[26] David R. O'Hallaron,et al. Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers , 1998 .
[27] Christian Lengauer,et al. Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation , 2012, Parallel Process. Lett..
[28] Jürgen Teich,et al. From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[29] Jun Yang,et al. FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs , 2018, ArXiv.
[30] Catherine Mills Olschanowsky,et al. Transforming loop chains via macro dataflow graphs , 2018, CGO.
[31] P. Sadayappan,et al. Resource conscious reuse-driven tiling for GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[32] Jiawen Chen,et al. Real-time edge-aware image processing with the bilateral grid , 2007, ACM Trans. Graph..
[33] Jingling Xue,et al. Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.
[34] Uday Bondhugula,et al. Diamond Tiling: Tiling Techniques to Maximize Parallelism for Stencil Computations , 2017, IEEE Transactions on Parallel and Distributed Systems.
[35] Xing Zhou,et al. Hierarchical overlapped tiling , 2012, CGO '12.
[36] Uday Bondhugula,et al. MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.
[37] Christopher G. Harris,et al. A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.
[38] Vivek Sarkar,et al. Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling , 2018, CC.
[39] Sanjay V. Rajopadhye,et al. Parameterized loop tiling , 2012, TOPL.
[40] Frédo Durand,et al. Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..
[41] Sriram Krishnamoorthy,et al. Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.
[42] Frédo Durand,et al. Fast Local Laplacian Filters , 2014, ACM Trans. Graph..
[43] Jing Xia,et al. DaVinci: A Scalable Architecture for Neural Network Computing , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).
[44] Robert J. Harrison,et al. On fusing recursive traversals of K-d trees , 2016, CC.
[45] Richard Veras,et al. When polyhedral transformations meet SIMD code generation , 2013, PLDI.
[46] Pierre Kornprobst,et al. Bilateral Filtering , 2009 .
[47] Shoaib Kamil,et al. OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[48] Albert Cohen,et al. Polyhedral Code Generation in the Real World , 2006, CC.
[49] Pen-Chung Yew,et al. Tile size selection revisited , 2013, ACM Trans. Archit. Code Optim..
[50] Uday Bondhugula,et al. PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.
[51] François Irigoin,et al. Supernode partitioning , 1988, POPL '88.
[52] Albert Cohen,et al. Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.
[53] Albert Cohen,et al. Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.
[54] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.
[55] Louis-Noël Pouchet,et al. Model-driven transformations for multi- and many-core CPUs , 2019, PLDI.
[56] Paul Feautrier,et al. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.
[57] Pen-Chung Yew,et al. Revisiting loop fusion in the polyhedral framework , 2014, PPoPP '14.