Improving the Programmability of GPU Architectures

• A submitted manuscript is the author's version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers.

[1]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[2]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[3]  Henk Corporaal,et al.  The boat hull model: enabling performance prediction for parallel computing prior to code development , 2012, CF '12.

[4]  Derek L. Schuff,et al.  Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[5]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[6]  Kleanthis Psarris,et al.  The I Test: An Improved Dependence Test for Automatic Parallelization and Vectorization , 1991, IEEE Trans. Parallel Distributed Syst..

[7]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[8]  Henk Corporaal,et al.  Skeleton-based automatic parallelization of image processing algorithms for GPUs , 2011, 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[9]  Paul Feautrier,et al.  Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.

[10]  Adam Betts,et al.  GPUVerify: a verifier for GPU kernels , 2012, OOPSLA '12.

[11]  Herbert Kuchen,et al.  Data Parallel Skeletons for GPU Clusters and Multi-GPU Systems , 2011, PARCO.

[12]  Tarek S. Abdelrahman,et al.  hiCUDA: High-Level GPGPU Programming , 2011, IEEE Transactions on Parallel and Distributed Systems.

[13]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[15]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[16]  Henk Corporaal,et al.  Algorithmic species revisited: A program code classification based on array references , 2013, 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS).

[17]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[18]  Henk Corporaal,et al.  The boat hull model: adapting the roofline model to enable performance prediction for parallel computing , 2012, PPoPP '12.

[19]  Tia Newhall,et al.  Chestnut: a GPU programming language for non-experts , 2012, PMAM '12.

[20]  Engin Ipek,et al.  Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[21]  Henk Corporaal,et al.  Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries , 2012 .

[22]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[23]  Jürgen Teich,et al.  Generating GPU Code from a High-Level Representation for Image Processing Kernels , 2010, Euro-Par Workshops.

[24]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25]  Pierre Boulet,et al.  Array-OL Revisited, Multidimensional Intensive Signal Processing Specification , 2007 .

[26]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[27]  Michael F. P. O'Boyle,et al.  Portable compiler optimisation across embedded programs and microarchitectures using machine learning , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[29]  Rudy Lauwereins,et al.  Architecture exploration for a reconfigurable architecture template , 2005, IEEE Design & Test of Computers.

[30]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[31]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[32]  Da Wang,et al.  Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[33]  Sergei Gorlatch,et al.  SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[34]  Jie Shen,et al.  Performance Gaps between OpenMP and OpenCL for Multi-core CPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[35]  Rob van Nieuwpoort,et al.  Evaluating multi-core platforms for HPC data-intensive kernels , 2009, CF '09.

[36]  Samuel H. Fuller,et al.  Computing Performance: Game Over or Next Level? , 2011, Computer.

[37]  Michael Garland,et al.  Understanding throughput-oriented architectures , 2010, Commun. ACM.

[38]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[39]  Sven Verdoolaege,et al.  Polyhedral Process Networks , 2010, Handbook of Signal Processing Systems.

[40]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[41]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[42]  Kurt Keutzer,et al.  A design pattern language for engineering (parallel) software: merging the PLPP and OPL projects , 2010, ParaPLoP '10.

[43]  Tor M. Aamodt,et al.  A first-order fine-grained multithreaded throughput model , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[44]  Albert Cohen,et al.  Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[45]  Henk Corporaal,et al.  Roofline-aware DVFS for GPUs , 2014, ADAPT '14.

[46]  Wouter Caarls,et al.  Automated Design of Application-Specific Smart Camera Architectures , 2008 .

[47]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[48]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[49]  R. Dolbeau,et al.  HMPP TM : A Hybrid Multi-core Parallel Programming Environment , 2022 .

[50]  François Irigoin,et al.  Exact versus Approximate Array Region Analyses , 1996, LCPC.

[51]  Christoph W. Kessler,et al.  Adaptive Implementation Selection in the SkePU Skeleton Programming Library , 2013, APPT.

[52]  Henk Corporaal,et al.  Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification , 2013, APPT.

[53]  Stéphane Mancini,et al.  Automatic generation of a parallel tile processing unit for algorithms with non-affine array references , 2008, IFMT '08.

[54]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[55]  Albert Cohen,et al.  PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs , 2013, HiPC 2013.

[56]  David Patterson,et al.  The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges , 2009 .

[57]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[58]  Dragan Bosnacki,et al.  Improving GPU Sparse Matrix-Vector Multiplication for Probabilistic Model Checking , 2012, SPIN.

[59]  Michael F. P. O'Boyle,et al.  Milepost GCC: Machine Learning Enabled Self-tuning Compiler , 2011, International Journal of Parallel Programming.

[60]  Stephen A. Jarvis,et al.  An investigation of the performance portability of OpenCL , 2013, J. Parallel Distributed Comput..

[61]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[62]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[63]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[64]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[65]  Kai Li,et al.  Thread scheduling for cache locality , 1996, ASPLOS VII.

[66]  Hideya Iwasaki,et al.  A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming , 2009, APLAS.

[67]  Michael Bedford Taylor,et al.  Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.

[68]  David Black-Schaffer,et al.  Modeling performance variation due to cache sharing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[69]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[70]  Dominik Grewe,et al.  Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation , 2011, GPGPU-4.

[71]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[72]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[73]  Vikram Bhatt,et al.  The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.

[74]  Liwen Chang,et al.  Optimization and architecture effects on GPU computing workload performance , 2012, 2012 Innovative Parallel Computing (InPar).

[75]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[76]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[77]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[78]  Coniferous softwood GENERAL TERMS , 2003 .

[79]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[80]  Jürgen Teich,et al.  Generating Device-specific GPU Code for Local Operators in Medical Imaging , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[81]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[82]  Ronan Keryell,et al.  Par4All: From Convex Array Regions to Heterogeneous Computing , 2012, HiPEAC 2012.

[83]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[84]  David Patterson The trouble with multi-core , 2010, IEEE Spectrum.

[85]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[86]  Paul H. J. Kelly,et al.  Deriving Efficient Data Movement from Decoupled Access/Execute Specifications , 2008, HiPEAC.

[87]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[88]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[89]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[90]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[91]  Henk Corporaal,et al.  High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs , 2011, GPGPU-4.

[92]  Richard J. Enbody,et al.  On the mathematics of caching , 2003 .

[93]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[94]  Berna L. Massingill,et al.  A Pattern Language for Parallel Application Programming , 1999 .

[95]  Henk Corporaal,et al.  GPU-CC: a reconfigurable GPU architecture with communicating cores , 2013, M-SCOPES.

[96]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[97]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[98]  C. Cascaval,et al.  Calculating stack distances efficiently , 2003, MSP '02.

[99]  Yifan He,et al.  OpenCL code generation for low energy wide SIMD architectures with explicit datapath , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[100]  Tao Tang,et al.  Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.

[101]  Vladimir Vlassov,et al.  Locality-Aware Task Scheduling and Data Distribution on NUMA Systems , 2013, IWOMP.

[102]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[103]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[104]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[105]  Henk Corporaal,et al.  Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs , 2011, ACIVS.

[106]  Arun Parakh,et al.  Performance Estimation of GPUs with Cache , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[107]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[108]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[109]  Wu-chun Feng,et al.  The Green500 List: Encouraging Sustainable Supercomputing , 2007, Computer.

[110]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[111]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[112]  Mahmut T. Kandemir,et al.  A hyperplane based approach for optimizing spatial locality in loop nests , 1998, ICS '98.

[113]  Henk Corporaal,et al.  A modular and parameterisable classification of algorithms , 2011 .

[114]  I. Kontaxakis Contribution to Image Segmentation and Integral Image Coding , 2010 .

[115]  Feng Liu,et al.  Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.

[116]  Alain Darte On the Complexity of Loop Fusion , 2000, Parallel Comput..

[117]  Mehdi Amini,et al.  Beyond Do Loops: Data Transfer Generation with Convex Array Regions , 2012, LCPC.

[118]  Cédric Bastoul,et al.  Predictive Modeling in a Polyhedral Optimization Space , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[119]  Uday Bondhugula,et al.  Combined iterative and model-driven optimization in an automatic parallelization framework , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[120]  Henk Corporaal,et al.  Future of GPGPU micro-architectural parameters , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[121]  Timothy G. Mattson,et al.  A Pattern Language for Parallel Application Programs (Research Note) , 2000, Euro-Par.

[122]  Sven Verdoolaege,et al.  Polyhedral Extraction Tool , 2012 .

[123]  Henk Corporaal,et al.  GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU , 2012, Euro-Par.

[124]  William J. Dally,et al.  A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors , 2012, TOCS.

[125]  Ahmad Khonsari,et al.  Dynamic warp resizing: Analysis and benefits in high-performance SIMT , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[126]  Albert Cohen,et al.  Putting Automatic Polyhedral Compilation for GPGPU to Work , 2011 .

[127]  H. Corporaal,et al.  Algorithmic skeletons for stream programming in embedded heterogeneous parallel image processing applications , 2003, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[128]  Michaël Krajecki,et al.  Source-to-Source Code Translator: OpenMP C to CUDA , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[129]  Michael F. P. O'Boyle,et al.  A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[130]  Stephen W. Poole,et al.  An idiom-finding tool for increasing productivity of accelerators , 2011, ICS '11.

[131]  Bart Kienhuis,et al.  KPN2GPU: an approach for discovery and exploitation of fine-grain data parallelism in process networks , 2011, CARN.

[132]  Paul H. J. Kelly,et al.  Design and Performance of the OP2 Library for Unstructured Mesh Applications , 2011, Euro-Par Workshops.

[133]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[134]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[135]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[136]  Henk Corporaal,et al.  Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons , 2012, GPGPU-5.

[137]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[138]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).