Divergence Analysis and Optimizations
暂无分享,去创建一个
Fernando Magno Quintão Pereira | Wagner Meira | Bruno Coutinho | Diogo Sampaio | Diogo Sampaio | Bruno Coutinho | Wagner Meira Jr
[1] Cristina Cifuentes,et al. User-Input Dependence Analysis via Graph Reachability , 2008, 2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation.
[2] Jaewook Shin. Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).
[3] Mark N. Wegman,et al. Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.
[4] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[5] Dorota H. Kieronska,et al. Formal Specification of Parallel SIMD Execution , 1996, Theor. Comput. Sci..
[6] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[7] Ken Kennedy,et al. Loop distribution with arbitrary control flow , 1990, Proceedings SUPERCOMPUTING '90.
[8] Xipeng Shen,et al. On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.
[9] Michael Stepp,et al. Equality saturation: a new approach to optimization , 2009, POPL '09.
[10] J. Nickolls. Graphics and Computing GPUs , 2022 .
[11] Guy E. Blelloch,et al. Vcode: a data-parallel intermediate language , 1990, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation.
[12] Michael Garland,et al. Understanding throughput-oriented architectures , 2010, Commun. ACM.
[13] Rolf Wanka,et al. Efficient oblivious parallel sorting on the MasPar MP-1 , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.
[14] Etienne Morel,et al. Global optimization by suppression of partial redundancies , 1979, CACM.
[15] Andrew W. Appel,et al. Modern Compiler Implementation in Java , 1997 .
[16] Sudhakar Yalamanchili,et al. A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[17] Luc Bougé,et al. Control structures for data-parallel SIMD languages: semantics and implementation , 1992, Future Gener. Comput. Syst..
[18] David W. Binkley,et al. Program slicing , 2008, 2008 Frontiers of Software Maintenance.
[19] Philippas Tsigas,et al. GPU-Quicksort: A practical Quicksort algorithm for graphics processors , 2010, JEAL.
[20] Arthur B. Maccabe,et al. The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages , 1990, PLDI '90.
[21] Satoshi Matsuoka,et al. Massive supercomputing coping with heterogeneity of modern accelerators , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[22] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.
[23] Pradeep Dubey,et al. Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.
[24] D. V. Bhaskar Rao,et al. Wavefront Array Processor: Language, Architecture, and Applications , 1982, IEEE Transactions on Computers.
[25] Fernando Magno Quintão Pereira,et al. Performance Debugging of GPGPU Applications with the Divergence Map , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing.
[26] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[27] Joe D. Warren,et al. The program dependence graph and its use in optimization , 1987, TOPL.
[28] Xipeng Shen,et al. Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.
[29] Yoichi Muraoka,et al. TRANQUIL: a language for an array processing computer , 1969, AFIPS '69 (Spring).
[30] Bobby Bodenheimer,et al. Synthesis and evaluation of linear motion transitions , 2008, TOGS.
[31] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .
[32] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.
[33] Michael J. Flynn,et al. Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.
[34] Duncan H. Lawrie,et al. Glypnir—a programming language for Illiac IV , 1975, Commun. ACM.
[35] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[36] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[37] Ahmed Sameh,et al. The Illiac IV system , 1972 .
[38] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[39] Yao Zhang,et al. Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations , 2009, Euro-Par Workshops.
[40] Ronald H. Perrott,et al. A Language for Array and Vector Processors , 1979, TOPL.
[41] Ronan Keryell,et al. POMP or How to Design a Massively Parallel Machine with Small Developments , 1991, PARLE.
[42] Mike Murphy,et al. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.
[43] D. Mount. Bioinformatics: Sequence and Genome Analysis , 2001 .