Improving the Programmability of GPU Architectures
暂无分享,去创建一个
[1] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[2] Wen-mei W. Hwu,et al. CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.
[3] Henk Corporaal,et al. The boat hull model: enabling performance prediction for parallel computing prior to code development , 2012, CF '12.
[4] Derek L. Schuff,et al. Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).
[5] Ken Kennedy,et al. Automatic translation of FORTRAN programs to vector form , 1987, TOPL.
[6] Kleanthis Psarris,et al. The I Test: An Improved Dependence Test for Automatic Parallelization and Vectorization , 1991, IEEE Trans. Parallel Distributed Syst..
[7] Michael Wolfe,et al. Implementing the PGI Accelerator model , 2010, GPGPU-3.
[8] Henk Corporaal,et al. Skeleton-based automatic parallelization of image processing algorithms for GPUs , 2011, 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.
[9] Paul Feautrier,et al. Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.
[10] Adam Betts,et al. GPUVerify: a verifier for GPU kernels , 2012, OOPSLA '12.
[11] Herbert Kuchen,et al. Data Parallel Skeletons for GPU Clusters and Multi-GPU Systems , 2011, PARCO.
[12] Tarek S. Abdelrahman,et al. hiCUDA: High-Level GPGPU Programming , 2011, IEEE Transactions on Parallel and Distributed Systems.
[13] Henk Corporaal,et al. A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[14] Murray Cole,et al. Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .
[15] Keshav Pingali,et al. The tao of parallelism in algorithms , 2011, PLDI '11.
[16] Henk Corporaal,et al. Algorithmic species revisited: A program code classification based on array references , 2013, 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS).
[17] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.
[18] Henk Corporaal,et al. The boat hull model: adapting the roofline model to enable performance prediction for parallel computing , 2012, PPoPP '12.
[19] Tia Newhall,et al. Chestnut: a GPU programming language for non-experts , 2012, PMAM '12.
[20] Engin Ipek,et al. Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.
[21] Henk Corporaal,et al. Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries , 2012 .
[22] Yao Zhang,et al. Parallel Computing Experiences with CUDA , 2008, IEEE Micro.
[23] Jürgen Teich,et al. Generating GPU Code from a High-Level Representation for Image Processing Kernels , 2010, Euro-Par Workshops.
[24] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[25] Pierre Boulet,et al. Array-OL Revisited, Multidimensional Intensive Signal Processing Specification , 2007 .
[26] Michael Stumm,et al. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.
[27] Michael F. P. O'Boyle,et al. Portable compiler optimisation across embedded programs and microarchitectures using machine learning , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[28] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[29] Rudy Lauwereins,et al. Architecture exploration for a reconfigurable architecture template , 2005, IEEE Design & Test of Computers.
[30] R.H. Dennard,et al. Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.
[31] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[32] Da Wang,et al. Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.
[33] Sergei Gorlatch,et al. SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[34] Jie Shen,et al. Performance Gaps between OpenMP and OpenCL for Multi-core CPUs , 2012, 2012 41st International Conference on Parallel Processing Workshops.
[35] Rob van Nieuwpoort,et al. Evaluating multi-core platforms for HPC data-intensive kernels , 2009, CF '09.
[36] Samuel H. Fuller,et al. Computing Performance: Game Over or Next Level? , 2011, Computer.
[37] Michael Garland,et al. Understanding throughput-oriented architectures , 2010, Commun. ACM.
[38] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[39] Sven Verdoolaege,et al. Polyhedral Process Networks , 2010, Handbook of Signal Processing Systems.
[40] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.
[41] Herb Sutter,et al. The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .
[42] Kurt Keutzer,et al. A design pattern language for engineering (parallel) software: merging the PLPP and OPL projects , 2010, ParaPLoP '10.
[43] Tor M. Aamodt,et al. A first-order fine-grained multithreaded throughput model , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.
[44] Albert Cohen,et al. Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).
[45] Henk Corporaal,et al. Roofline-aware DVFS for GPUs , 2014, ADAPT '14.
[46] Wouter Caarls,et al. Automated Design of Application-Specific Smart Camera Architectures , 2008 .
[47] Sven Verdoolaege,et al. isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.
[48] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.
[49] R. Dolbeau,et al. HMPP TM : A Hybrid Multi-core Parallel Programming Environment , 2022 .
[50] François Irigoin,et al. Exact versus Approximate Array Region Analyses , 1996, LCPC.
[51] Christoph W. Kessler,et al. Adaptive Implementation Selection in the SkePU Skeleton Programming Library , 2013, APPT.
[52] Henk Corporaal,et al. Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification , 2013, APPT.
[53] Stéphane Mancini,et al. Automatic generation of a parallel tile processing unit for algorithms with non-affine array references , 2008, IFMT '08.
[54] Christoph W. Kessler,et al. SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.
[55] Albert Cohen,et al. PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs , 2013, HiPC 2013.
[56] David Patterson,et al. The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges , 2009 .
[57] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.
[58] Dragan Bosnacki,et al. Improving GPU Sparse Matrix-Vector Multiplication for Probabilistic Model Checking , 2012, SPIN.
[59] Michael F. P. O'Boyle,et al. Milepost GCC: Machine Learning Enabled Self-tuning Compiler , 2011, International Journal of Parallel Programming.
[60] Stephen A. Jarvis,et al. An investigation of the performance portability of OpenCL , 2013, J. Parallel Distributed Comput..
[61] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[62] Scott B. Baden,et al. Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.
[63] James Demmel,et al. the Parallel Computing Landscape , 2022 .
[64] Jack J. Dongarra,et al. A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.
[65] Kai Li,et al. Thread scheduling for cache locality , 1996, ASPLOS VII.
[66] Hideya Iwasaki,et al. A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming , 2009, APLAS.
[67] Michael Bedford Taylor,et al. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.
[68] David Black-Schaffer,et al. Modeling performance variation due to cache sharing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[69] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[70] Dominik Grewe,et al. Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation , 2011, GPGPU-4.
[71] Timothy G. Mattson,et al. Patterns for parallel programming , 2004 .
[72] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.
[73] Vikram Bhatt,et al. The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future , 2011, IEEE Micro.
[74] Liwen Chang,et al. Optimization and architecture effects on GPU computing workload performance , 2012, 2012 Innovative Parallel Computing (InPar).
[75] Jianbin Fang,et al. A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.
[76] J. Ramanujam,et al. Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.
[77] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.
[78] Coniferous softwood. GENERAL TERMS , 2003 .
[79] Krste Asanovic,et al. Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[80] Jürgen Teich,et al. Generating Device-specific GPU Code for Local Operators in Medical Imaging , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[81] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[82] Ronan Keryell,et al. Par4All: From Convex Array Regions to Heterogeneous Computing , 2012, HiPEAC 2012.
[83] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.
[84] David Patterson. The trouble with multi-core , 2010, IEEE Spectrum.
[85] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.
[86] Paul H. J. Kelly,et al. Deriving Efficient Data Movement from Decoupled Access/Execute Specifications , 2008, HiPEAC.
[87] Rudolf Eigenmann,et al. OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[88] Lei Jiang,et al. Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[89] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[90] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[91] Henk Corporaal,et al. High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs , 2011, GPGPU-4.
[92] Richard J. Enbody,et al. On the mathematics of caching , 2003 .
[93] Hyesoon Kim,et al. An integrated GPU power and performance model , 2010, ISCA.
[94] Berna L. Massingill,et al. A Pattern Language for Parallel Application Programming , 1999 .
[95] Henk Corporaal,et al. GPU-CC: a reconfigurable GPU architecture with communicating cores , 2013, M-SCOPES.
[96] Kristof Beyls,et al. Reuse Distance as a Metric for Cache Behavior. , 2001 .
[97] Christian Terboven,et al. OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.
[98] C. Cascaval,et al. Calculating stack distances efficiently , 2003, MSP '02.
[99] Yifan He,et al. OpenCL code generation for low energy wide SIMD architectures with explicit datapath , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).
[100] Tao Tang,et al. Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.
[101] Vladimir Vlassov,et al. Locality-Aware Task Scheduling and Data Distribution on NUMA Systems , 2013, IWOMP.
[102] William Pugh,et al. The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).
[103] David F. Bacon,et al. Compiler transformations for high-performance computing , 1994, CSUR.
[104] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[105] Henk Corporaal,et al. Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs , 2011, ACIVS.
[106] Arun Parakh,et al. Performance Estimation of GPUs with Cache , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[107] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.
[108] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[109] Wu-chun Feng,et al. The Green500 List: Encouraging Sustainable Supercomputing , 2007, Computer.
[110] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[111] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[112] Mahmut T. Kandemir,et al. A hyperplane based approach for optimizing spatial locality in loop nests , 1998, ICS '98.
[113] Henk Corporaal,et al. A modular and parameterisable classification of algorithms , 2011 .
[114] I. Kontaxakis. Contribution to Image Segmentation and Integral Image Coding , 2010 .
[115] Feng Liu,et al. Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.
[116] Alain Darte. On the Complexity of Loop Fusion , 2000, Parallel Comput..
[117] Mehdi Amini,et al. Beyond Do Loops: Data Transfer Generation with Convex Array Regions , 2012, LCPC.
[118] Cédric Bastoul,et al. Predictive Modeling in a Polyhedral Optimization Space , 2011, International Symposium on Code Generation and Optimization (CGO 2011).
[119] Uday Bondhugula,et al. Combined iterative and model-driven optimization in an automatic parallelization framework , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[120] Henk Corporaal,et al. Future of GPGPU micro-architectural parameters , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[121] Timothy G. Mattson,et al. A Pattern Language for Parallel Application Programs (Research Note) , 2000, Euro-Par.
[122] Sven Verdoolaege,et al. Polyhedral Extraction Tool , 2012 .
[123] Henk Corporaal,et al. GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU , 2012, Euro-Par.
[124] William J. Dally,et al. A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors , 2012, TOCS.
[125] Ahmad Khonsari,et al. Dynamic warp resizing: Analysis and benefits in high-performance SIMT , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).
[126] Albert Cohen,et al. Putting Automatic Polyhedral Compilation for GPGPU to Work , 2011 .
[127] H. Corporaal,et al. Algorithmic skeletons for stream programming in embedded heterogeneous parallel image processing applications , 2003, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[128] Michaël Krajecki,et al. Source-to-Source Code Translator: OpenMP C to CUDA , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.
[129] Michael F. P. O'Boyle,et al. A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[130] Stephen W. Poole,et al. An idiom-finding tool for increasing productivity of accelerators , 2011, ICS '11.
[131] Bart Kienhuis,et al. KPN2GPU: an approach for discovery and exploitation of fine-grain data parallelism in process networks , 2011, CARN.
[132] Paul H. J. Kelly,et al. Design and Performance of the OP2 Library for Unstructured Mesh Applications , 2011, Euro-Par Workshops.
[133] Karthikeyan Sankaralingam,et al. Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[134] Alejandro Duran,et al. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..
[135] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[136] Henk Corporaal,et al. Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons , 2012, GPGPU-5.
[137] Alan Jay Smith,et al. Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.
[138] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).