Micro-architectural support for improving synchronization and efficiency of simd execution on gpus
暂无分享,去创建一个
[1] Marco Pistore,et al. Nusmv version 2: an opensource tool for symbolic model checking , 2002, CAV 2002.
[2] Edmund M. Clarke,et al. Formal Methods: State of the Art and Future Directions Working Group Members , 1996 .
[3] Cormac Flanagan,et al. Assume-Guarantee Model Checking , 2002 .
[4] Philippas Tsigas,et al. Dynamic Load Balancing Using Work-Stealing , 2011 .
[5] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[6] Adam Levinthal,et al. Chap - a SIMD graphics processor , 1984, SIGGRAPH.
[7] Bingsheng He,et al. High-Throughput Transaction Executions on Graphics Processors , 2011, Proc. VLDB Endow..
[8] Christel Baier,et al. Principles of Model Checking (Representation and Mind Series) , 2008 .
[9] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.
[10] Rajat Raina,et al. Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.
[11] Sean Lie. Hardware Support for Unbounded Transactional Memory , 2004 .
[12] John D. Owens,et al. Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[13] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.
[14] Yu Yang,et al. Efficient methods for formally verifying safety properties of hierarchical cache coherence protocols , 2010, Formal Methods Syst. Des..
[15] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[16] Joonwon Lee,et al. Synchronization with multiprocessor caches , 1990, ISCA '90.
[17] John D. Owens,et al. Building an Efficient Hash Table on the GPU , 2012 .
[18] Philippas Tsigas,et al. Towards a Software Transactional Memory for Graphics Processors , 2010, EGPGV@Eurographics.
[19] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[20] Frederica Darema,et al. A single-program-multiple-data computational model for EPEX/FORTRAN , 1988, Parallel Comput..
[21] James R. Goodman,et al. Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.
[22] Tor M. Aamodt,et al. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.
[23] Mordechai Ben-Ari,et al. The temporal logic of branching time , 1981, POPL '81.
[24] A. Agarwal,et al. Adaptive backoff synchronization techniques , 1989, ISCA '89.
[25] Wen-mei W. Hwu,et al. Compute Unified Device Architecture Application Suitability , 2009, Computing in Science & Engineering.
[26] David L. Dill,et al. The Murphi Verification System , 1996, CAV.
[27] John D. Owens,et al. Efficient Synchronization Primitives for GPUs , 2011, ArXiv.
[28] Sofia Cassel,et al. Graph-Based Algorithms for Boolean Function Manipulation , 2012 .
[29] Mark S. Shephard,et al. Automatic three-dimensional mesh generation by the finite octree technique , 1984 .
[30] Leslie Lamport,et al. A new solution of Dijkstra's concurrent programming problem , 1974, Commun. ACM.
[31] Dean M. Tullsen,et al. Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[32] Amitabh Varshney,et al. High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.
[33] Joseph Y. Halpern,et al. “Sometimes” and “not never” revisited: on branching versus linear time temporal logic , 1986, JACM.
[34] Keshav Pingali,et al. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .
[35] Kunle Olukotun,et al. Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[36] Beng-Hong Lim,et al. Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.
[37] Kenneth L. McMillan,et al. Symbolic model checking: an approach to the state explosion problem , 1992 .
[38] Per Brinch Hansen. The Origin of Concurrent Programming , 2002, Springer New York.
[39] Nicolas Brunie,et al. Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[40] Allan Porterfield,et al. The Tera computer system , 1990 .
[41] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[42] Andreas Björklund,et al. The traveling salesman problem in bounded degree graphs , 2012, TALG.
[43] Keir Fraser,et al. Language support for lightweight transactions , 2014, SIGP.
[44] Fernando Magno Quintão Pereira,et al. Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[45] Zhongliang Chen,et al. Characterizing scalar opportunities in GPGPU applications , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[46] Wu-chun Feng,et al. To GPU synchronize or not GPU synchronize? , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.
[47] Ricardo Bianchini,et al. The MIT Alewife machine: architecture and performance , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[48] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[49] Nir Shavit,et al. Software transactional memory , 1995, PODC '95.
[50] Philip A. Bernstein,et al. Atomic Transactional Execution in Hardware: A New High-Performance Abstraction for Databases? , 2003 .
[51] Leslie Lamport,et al. A fast mutual exclusion algorithm , 1987, TOCS.
[52] Guang R. Gao,et al. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.
[53] Edsger W. Dijkstra,et al. Solution of a problem in concurrent programming control , 1965, CACM.
[54] Gary L. Peterson,et al. Myths About the Mutual Exclusion Problem , 1981, Inf. Process. Lett..
[55] David R. Kaeli,et al. Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[56] Suresh Jagannathan,et al. Transactional Monitors for Concurrent Objects , 2004, ECOOP.
[57] Wu-chun Feng,et al. Performance Characterization and Optimization of Atomic Operations on AMD GPUs , 2011, 2011 IEEE International Conference on Cluster Computing.
[58] David A. Padua,et al. Efficient building and placing of gating functions , 1995, PLDI '95.
[59] Alan J. Hu,et al. Protocol verification as a hardware design aid , 1992, Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors.
[60] Maurice Herlihy,et al. The art of multiprocessor programming , 2020, PODC '06.
[61] Nir Shavit,et al. Transactional Locking II , 2006, DISC.
[62] Mohammad Abdel-Majeed,et al. Warped register file: A power efficient register file for GPGPUs , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[63] Kenneth L. McMillan,et al. Symbolic model checking , 1992 .
[64] Mark Moir,et al. Transparent Support for Wait-Free Transactions , 1997, WDAG.
[65] Kenneth L. McMillan,et al. Parameterized Verification of the FLASH Cache Coherence Protocol by Compositional Model Checking , 2001, CHARME.
[66] Maurice Herlihy,et al. Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.
[67] Chin-Laung Lei,et al. Modalities for Model Checking: Branching Time Logic Strikes Back , 1987, Sci. Comput. Program..
[68] James R. Goodman,et al. Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.
[69] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[70] Somesh Jha,et al. Verification of the Futurebus+ cache coherence protocol , 1993, Formal Methods Syst. Des..
[71] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.
[72] Ken Kennedy,et al. Conversion of control dependence to data dependence , 1983, POPL '83.
[73] Yu Yang,et al. Hierarchical cache coherence protocol verification one level at a time through assume guarantee , 2007, 2007 IEEE International High Level Design Validation and Test Workshop.
[74] Michel Dubois,et al. Formal verification of delayed consistency protocols , 1996, Proceedings of International Conference on Parallel Processing.
[75] Philippas Tsigas,et al. On dynamic load balancing on graphics processors , 2008, GH '08.
[76] Eric Freudenthal,et al. Process coordination with fetch-and-increment , 1991 .
[77] S.A. Manavski,et al. CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography , 2007, 2007 IEEE International Conference on Signal Processing and Communications.
[78] Traviss. Craig,et al. Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .
[79] Mary K. Vernon,et al. Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.
[80] Martin Burtscher,et al. A Parallel GPU Version of the Traveling Salesman Problem , 2011 .
[81] Virendra J. Marathe,et al. Adaptive Software Transactional Memory , 2005, DISC.
[82] Marina Papatriantafilou,et al. Reactive spin-locks: a self-tuning approach , 2005, 8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05).
[83] P. J. Narayanan,et al. Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.
[84] Erik Hagersten,et al. Hierarchical backoff locks for nonuniform communication architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..
[85] Chia-Lin Yang,et al. Power gating strategies on GPUs , 2011, TACO.
[86] Gerard J. Holzmann,et al. The Model Checker SPIN , 1997, IEEE Trans. Software Eng..
[87] Andrew Brownsword,et al. Kilo TM: Hardware Transactional Memory for GPU Architectures , 2012, IEEE Micro.
[88] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.
[89] Mattan Erez,et al. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[90] Michael J. Flynn,et al. Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.
[91] Philippas Tsigas,et al. Understanding the Performance of Concurrent Data Structures on Graphics Processors , 2012, Euro-Par.
[92] J. Krüger,et al. Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..
[93] Gerhard Reinelt,et al. TSPLIB - A Traveling Salesman Problem Library , 1991, INFORMS J. Comput..
[94] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .
[95] Michel Dubois,et al. Verification techniques for cache coherence protocols , 1997, CSUR.
[96] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[97] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[98] James H. Anderson,et al. Adaptive mutual exclusion with local spinning , 2006, Distributed Computing.
[99] Thomas E. Anderson,et al. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..