Tiled QR Decomposition and Its Optimization on CPU and GPU Computing System

There can be many types of heterogeneous computing systems, and the most useful one is the CPU and GPU computing system. In this system, we try to run QR decomposition, which expresses a standard real matrix as a production of two matrices. For a tiled QR decomposition algorithm, which is a parallelized version of QR decomposition, because of the heterogeneity of computing devices and communication cost, the way that each tile is distributed into which device is the main issue of tiled QR decomposition. The goal of this study is to optimize the tile distribution and the tiled QR decomposition operation mathematically, depending on the given system. We select the main computing device for the main steps of the algorithm, optimize the number of devices, and optimize the tile distribution among the devices using a distribution guide array. Our evaluation confirms that our method has good scalability and the optimization process maximizes the tiled QR decomposition performance.

[1]  Yves Robert,et al.  Tiled QR factorization algorithms , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[3]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[4]  Alston S. Householder,et al.  Unitary Triangularization of a Nonsymmetric Matrix , 1958, JACM.

[5]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[6]  Robert A. van de Geijn,et al.  Retargeting PLAPACK to clusters with hardware accelerators , 2010, 2010 International Conference on High Performance Computing & Simulation.

[7]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[8]  Jack J. Dongarra,et al.  Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.