Development methodologies for GPU and cluster of GPUs

This book chapter proposes to draw several development methodologies to obtain efficient codes in classical scientific applications. Those methodologies are based on the feedback from several research works involving GPUs, either alone in a single machine or in a cluster of machines. Indeed, our past collaborations with industries have allowed us to point out that in their economical context, they can adopt a parallel technology only if its implementation and maintenance costs are small according to the potential benefits (performance, accuracy,...). So, in such contexts, GPU programming is still regarded with some distance according to its specific field of applicability (SIMD/SIMT model) and its still higher programming complexity and maintenance. In the academic domain, things are a bit different but studies for efficiently integrating GPU computations in multi-core clusters with maximal overlapping of computations with communications and/or other computations, are still rare. For these reasons, the major aim of that chapter is to propose as simple as possible general programming patterns that can be followed or adapted in practical implementations of parallel scientific applications. Also, we propose in a third part, a prospect analysis together with a particular programming tool that is intended to ease multi-core GPU cluster programming.

[1]  Thomas Jost,et al.  Impact of Asynchronism on GPU Accelerated Parallel Iterative Computations , 2010, PARA.

[2]  Shin Heu,et al.  Experimental Validation of , 1991 .

[3]  Raphaël Couturier,et al.  Parallel Iterative Algorithms: From Sequential to Grid Computing (Chapman & Hall/crc Numerical Analy & Scient Comp. Series) , 2007 .

[4]  Jacques M. Bahi,et al.  An Efficient and Robust Decentralized Algorithm for Detecting the Global Convergence in Asynchronous Iterative Algorithms , 2008, VECPAR.

[5]  Jens Gustedt,et al.  Iterative computations with ordered read-write locks , 2010, J. Parallel Distributed Comput..

[6]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[7]  Stéphane Vialle,et al.  Optimizing Computing and Energy Performances in Heterogeneous Clusters of CPUs and GPUs , 2012, Handbook of Energy-Aware and Green Computing.

[8]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[9]  D. Kandiyoti 1 introduction. , 2005, Journal of the ICRU.

[10]  Torsten Hoefler,et al.  Overlapping Communication and Computation with High Level Communication Routines , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[11]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[12]  Raphaël Couturier,et al.  Asynchronism for iterative algorithms in a global computing environment , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[13]  Jacques M. Bahi,et al.  Load Balancing in Dynamic Networks by Bounded Delays Asynchronous Diffusion , 2010, VECPAR.

[14]  Stéphane Vialle,et al.  The parXXL Environment: Scalable Fine Grained Development for Large Coarse Grained Platforms , 2006, PARA.

[15]  C. R. Henson Conclusion , 1969 .

[16]  Sylvain Contassot-Vivier,et al.  Optimization methodology for Parallel Programming of Homogeneous or Hybrid Clusters , 2014 .

[17]  Jacques M. Bahi,et al.  Performance Comparison of Parallel Programming Environments for Implementing AIAC Algorithms , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[18]  Jens Gustedt,et al.  Relaxed Synchronization with Ordered Read-Write Locks , 2011, Euro-Par Workshops.

[19]  Jacques M. Bahi,et al.  Evaluation of the asynchronous iterative algorithms in the context of distant heterogeneous clusters , 2005, Parallel Comput..

[20]  Daniel B. Szyld,et al.  Asynchronous Iterations , 2011, Encyclopedia of Parallel Computing.