Using the loop chain abstraction to schedule across loops in existing code
暂无分享,去创建一个
Catherine Mills Olschanowsky | Michelle Mills Strout | Stephen M. Guzik | Jordan Riley | Eddie C. Davis | Ian J. Bertolacci
[1] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[2] Albert Cohen,et al. Coarse-Grained Loop Parallelization: Iteration Space Slicing vs Affine Transformations , 2009, ISPDC.
[3] Albert Cohen,et al. PrimeTile: A Parametric Multi-Level Tiler for Imperfect Loop Nests , 2009 .
[4] Joel H. Saltz,et al. An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications , 2009, IEEE Transactions on Parallel and Distributed Systems.
[5] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.
[6] Alejandro Duran,et al. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..
[7] Jack J. Dongarra,et al. Analytical modeling and optimization for affinity based thread scheduling on multicore systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[8] Ken Kennedy,et al. Optimizing for parallelism and data locality , 1992 .
[9] David A. Padua,et al. Task-Parallel versus Data-Parallel Library-Based Programming in Multicore Systems , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.
[10] Chun Chen,et al. Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.
[11] Vikram K. Narayana,et al. Reconfiguration and Communication-Aware Task Scheduling for High-Performance Reconfigurable Computing , 2010, TRETS.
[12] Albert Cohen,et al. Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.
[13] John Shalf,et al. TiDA: High-Level Programming Abstractions for Data Locality Management , 2016, ISC.
[14] Catherine Mills Olschanowsky,et al. A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[15] William Gropp,et al. Annotations for Productivity and Performance Portability , 2007 .
[16] Fabrice Rastello,et al. Description, Implementation and Evaluation of an Affinity Clause for Task Directives , 2016, IWOMP.
[17] Sanjay V. Rajopadhye,et al. Parameterized tiled loops for free , 2007, PLDI '07.
[18] Dan Quinlan,et al. The ROSE Source-to-Source Compiler Infrastructure , 2011 .
[19] Uday Bondhugula,et al. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors , 2009, PPoPP '09.
[20] Richard W. Vuduc,et al. Performance evaluation of concurrent collections on high-performance multicore computing systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[21] Richard Veras,et al. A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.
[22] Kevin Skadron,et al. Exploiting inter-thread temporal locality for chip multithreading , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[23] Alejandro Duran,et al. Extending the OpenMP Tasking Model to Allow Dependent Tasks , 2008, IWOMP.
[24] Robert A. van de Geijn,et al. SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.
[25] Andrew Stone,et al. Abstractions to separate concerns in semi-regular grids , 2013, ICS '13.
[26] Dirk Schmidl,et al. Data and thread affinity in openmp programs , 2008, MAW '08.
[27] Scott B. Baden,et al. Latency Hiding and Performance Tuning with Graph-Based Execution , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.
[28] Samuel Williams,et al. Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.
[29] Helmar Burkhart,et al. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[30] Uday Bondhugula,et al. PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.
[31] Kathleen Knobe,et al. Concurrent Collections on Distributed Memory Theory Put into Practice , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[32] Xing Zhou,et al. Hierarchical overlapped tiling , 2012, CGO '12.
[33] David Parello,et al. Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.
[34] Jesús Labarta,et al. A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.
[35] Qing Yi,et al. POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..
[36] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.
[37] William Pugh,et al. Iteration Space Slicing for Locality , 1999, LCPC.
[38] Sanjay V. Rajopadhye,et al. Parameterized loop tiling , 2012, TOPL.
[39] L. Dagum,et al. OpenMP: an industry standard API for shared-memory programming , 1998 .
[40] Sven Verdoolaege,et al. isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.