Micro-architectural support for improving synchronization and efficiency of simd execution on gpus

GPUs dedicate a majority of their transistor budgets to compute units rather than control logic. As a result, they can achieve excellent data-parallel power/performance. Given the continual demands for performance and power eefficiency, GPUs have become today's compute accelerators for many application domains. The general purpose community has been focusing on developing strategies to move a broader class of applications to these powerful devices. The underlying GPU architecture has been adapted to run a limited class of general purpose computations present across a range of applications. Many applications have already been ported to GPU platforms to take advantage of the potential data-parallel performance that GPUs afford. But there still remain barriers to migrating a broader class of applications onto GPUs. Being originally designed to run 3-D graphics, GPUs are highly optimized for graphics workloads. Graphics workloads possess a high degree of uniformity in their execution. Therefore, GPU architectures are optimized for efficient uniform execution. GPUs achieve high performance with data-parallel applications possessing regular control flow (i.e., predictable loops) and data access patterns that can effectively exploit high o-chip memory bandwidth. However, many general-purpose real world applications differ from graphics workloads - they come with large input sets exhibiting irregular access and synchronization patterns, and they possess varying computational granularity and irregular control flow. The current requirements for uniformity and predictability present barriers to moving a broader range of applications to GPUs. We believe if GPUs are going to become a mainstream computing device that it is necessary to relax some of these constraints. Only then can a wider variety of applications exploit the computational power of GPUs. One critical barrier present in non-uniform data-parallel applications is the need to synchronize between threads. Fine-grained synchronization is needed to support shared data access, especially when faced with irregular access and communication patterns. This dissertation presents a new approach to enhance the efficiency and scalability of GPU synchronization. The proposed scheme can enable applications that work on shared data to effectively communicate at finer levels of granularity. To achieve this ambitious goal, we propose a new synchronization approach called Hierarchical Queuing Locks (HQL). HQL is a novel hardware-based synchronization mechanism which provides efficient use of resources through execution blocking and hierarchical queuing. To provide a queue-based locking mechanism, HQL extends current GPU L1 and L2 cache management protocols by adding a synchronization protocol. Integration of HQL's synchronization protocol simplies the synchronization, but adds a level of complexity to the cache management protocol. Given this added complexity to the cache management scheme, as part of this dissertation we provide a formal verication of the proposed HQL synchronization protocol. To evaluate the benets of HQL, we start with studying a set of micro-benchmarks that represent highly irregular applications that require frequent synchronization. We additionally evaluate macro-benchmarks that utilize synchronization. We report on both the performance benefits and the savings in terms of instructions executed. Building upon the efficient fine-grained synchronization support provided for by HQL, we explore Scalar Waving (SW) and Simultaneous Scalar and SIMD group Waving (SSSW) architectures to further improve efficiency of SIMD execution on GPUs. These two mechanisms attempt to reduce the amount of redundant computations performed by the threads in a SIMD group. SW and SSSW improve SIMD efficiency for both irregular and regular applications. We motivate this work by reporting on the percent of redundant computations present in a range of workloads. We then quantitatively evaluate the benefits of SW and SSSW architectures using programs taken from four different benchmark suites. The impact of this dissertation design architectural features that can make the benets of GPU computing available to a much wider range of applications. These kind of enhancements can only further accelerate the adoption of GPUs as a rst-class computing device.

[1]  Marco Pistore,et al.  Nusmv version 2: an opensource tool for symbolic model checking , 2002, CAV 2002.

[2]  Edmund M. Clarke,et al.  Formal Methods: State of the Art and Future Directions Working Group Members , 1996 .

[3]  Cormac Flanagan,et al.  Assume-Guarantee Model Checking , 2002 .

[4]  Philippas Tsigas,et al.  Dynamic Load Balancing Using Work-Stealing , 2011 .

[5]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Adam Levinthal,et al.  Chap - a SIMD graphics processor , 1984, SIGGRAPH.

[7]  Bingsheng He,et al.  High-Throughput Transaction Executions on Graphics Processors , 2011, Proc. VLDB Endow..

[8]  Christel Baier,et al.  Principles of Model Checking (Representation and Mind Series) , 2008 .

[9]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[10]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[11]  Sean Lie Hardware Support for Unbounded Transactional Memory , 2004 .

[12]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[13]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[14]  Yu Yang,et al.  Efficient methods for formally verifying safety properties of hierarchical cache coherence protocols , 2010, Formal Methods Syst. Des..

[15]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Joonwon Lee,et al.  Synchronization with multiprocessor caches , 1990, ISCA '90.

[17]  John D. Owens,et al.  Building an Efficient Hash Table on the GPU , 2012 .

[18]  Philippas Tsigas,et al.  Towards a Software Transactional Memory for Graphics Processors , 2010, EGPGV@Eurographics.

[19]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[20]  Frederica Darema,et al.  A single-program-multiple-data computational model for EPEX/FORTRAN , 1988, Parallel Comput..

[21]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[22]  Tor M. Aamodt,et al.  Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[23]  Mordechai Ben-Ari,et al.  The temporal logic of branching time , 1981, POPL '81.

[24]  A. Agarwal,et al.  Adaptive backoff synchronization techniques , 1989, ISCA '89.

[25]  Wen-mei W. Hwu,et al.  Compute Unified Device Architecture Application Suitability , 2009, Computing in Science & Engineering.

[26]  David L. Dill,et al.  The Murphi Verification System , 1996, CAV.

[27]  John D. Owens,et al.  Efficient Synchronization Primitives for GPUs , 2011, ArXiv.

[28]  Sofia Cassel,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 2012 .

[29]  Mark S. Shephard,et al.  Automatic three-dimensional mesh generation by the finite octree technique , 1984 .

[30]  Leslie Lamport,et al.  A new solution of Dijkstra's concurrent programming problem , 1974, Commun. ACM.

[31]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[32]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[33]  Joseph Y. Halpern,et al.  “Sometimes” and “not never” revisited: on branching versus linear time temporal logic , 1986, JACM.

[34]  Keshav Pingali,et al.  An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[35]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[36]  Beng-Hong Lim,et al.  Reactive synchronization algorithms for multiprocessors , 1994, ASPLOS VI.

[37]  Kenneth L. McMillan,et al.  Symbolic model checking: an approach to the state explosion problem , 1992 .

[38]  Per Brinch Hansen The Origin of Concurrent Programming , 2002, Springer New York.

[39]  Nicolas Brunie,et al.  Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[40]  Allan Porterfield,et al.  The Tera computer system , 1990 .

[41]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[42]  Andreas Björklund,et al.  The traveling salesman problem in bounded degree graphs , 2012, TALG.

[43]  Keir Fraser,et al.  Language support for lightweight transactions , 2014, SIGP.

[44]  Fernando Magno Quintão Pereira,et al.  Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[45]  Zhongliang Chen,et al.  Characterizing scalar opportunities in GPGPU applications , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[46]  Wu-chun Feng,et al.  To GPU synchronize or not GPU synchronize? , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[47]  Ricardo Bianchini,et al.  The MIT Alewife machine: architecture and performance , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[48]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[49]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[50]  Philip A. Bernstein,et al.  Atomic Transactional Execution in Hardware: A New High-Performance Abstraction for Databases? , 2003 .

[51]  Leslie Lamport,et al.  A fast mutual exclusion algorithm , 1987, TOCS.

[52]  Guang R. Gao,et al.  Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[53]  Edsger W. Dijkstra,et al.  Solution of a problem in concurrent programming control , 1965, CACM.

[54]  Gary L. Peterson,et al.  Myths About the Mutual Exclusion Problem , 1981, Inf. Process. Lett..

[55]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[56]  Suresh Jagannathan,et al.  Transactional Monitors for Concurrent Objects , 2004, ECOOP.

[57]  Wu-chun Feng,et al.  Performance Characterization and Optimization of Atomic Operations on AMD GPUs , 2011, 2011 IEEE International Conference on Cluster Computing.

[58]  David A. Padua,et al.  Efficient building and placing of gating functions , 1995, PLDI '95.

[59]  Alan J. Hu,et al.  Protocol verification as a hardware design aid , 1992, Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computers & Processors.

[60]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[61]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[62]  Mohammad Abdel-Majeed,et al.  Warped register file: A power efficient register file for GPGPUs , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[63]  Kenneth L. McMillan,et al.  Symbolic model checking , 1992 .

[64]  Mark Moir,et al.  Transparent Support for Wait-Free Transactions , 1997, WDAG.

[65]  Kenneth L. McMillan,et al.  Parameterized Verification of the FLASH Cache Coherence Protocol by Compositional Model Checking , 2001, CHARME.

[66]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[67]  Chin-Laung Lei,et al.  Modalities for Model Checking: Branching Time Logic Strikes Back , 1987, Sci. Comput. Program..

[68]  James R. Goodman,et al.  Efficient Synchronization: Let Them Eat QOLB , 1997, International Symposium on Computer Architecture.

[69]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[70]  Somesh Jha,et al.  Verification of the Futurebus+ cache coherence protocol , 1993, Formal Methods Syst. Des..

[71]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[72]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[73]  Yu Yang,et al.  Hierarchical cache coherence protocol verification one level at a time through assume guarantee , 2007, 2007 IEEE International High Level Design Validation and Test Workshop.

[74]  Michel Dubois,et al.  Formal verification of delayed consistency protocols , 1996, Proceedings of International Conference on Parallel Processing.

[75]  Philippas Tsigas,et al.  On dynamic load balancing on graphics processors , 2008, GH '08.

[76]  Eric Freudenthal,et al.  Process coordination with fetch-and-increment , 1991 .

[77]  S.A. Manavski,et al.  CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography , 2007, 2007 IEEE International Conference on Signal Processing and Communications.

[78]  Traviss. Craig,et al.  Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[79]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[80]  Martin Burtscher,et al.  A Parallel GPU Version of the Traveling Salesman Problem , 2011 .

[81]  Virendra J. Marathe,et al.  Adaptive Software Transactional Memory , 2005, DISC.

[82]  Marina Papatriantafilou,et al.  Reactive spin-locks: a self-tuning approach , 2005, 8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05).

[83]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[84]  Erik Hagersten,et al.  Hierarchical backoff locks for nonuniform communication architectures , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[85]  Chia-Lin Yang,et al.  Power gating strategies on GPUs , 2011, TACO.

[86]  Gerard J. Holzmann,et al.  The Model Checker SPIN , 1997, IEEE Trans. Software Eng..

[87]  Andrew Brownsword,et al.  Kilo TM: Hardware Transactional Memory for GPU Architectures , 2012, IEEE Micro.

[88]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[89]  Mattan Erez,et al.  CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[90]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[91]  Philippas Tsigas,et al.  Understanding the Performance of Concurrent Data Structures on Graphics Processors , 2012, Euro-Par.

[92]  J. Krüger,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..

[93]  Gerhard Reinelt,et al.  TSPLIB - A Traveling Salesman Problem Library , 1991, INFORMS J. Comput..

[94]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[95]  Michel Dubois,et al.  Verification techniques for cache coherence protocols , 1997, CSUR.

[96]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[97]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[98]  James H. Anderson,et al.  Adaptive mutual exclusion with local spinning , 2006, Distributed Computing.

[99]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..