A Reconfigurable Architecture for QR Decomposition Using a Hybrid Approach

QR decomposition has been widely used in many signal processing applications to solve linear inverse problems. However, QR decomposition is considered a computationally expensive process, and its sequential implementations fail to meet the requirements of many time-sensitive applications. The Householder transformation and the Givens rotation are the most popular techniques to conduct QR decomposition. Each of these approaches have their own strengths and weakness. The Householder transformation lends itself to efficient sequential implementation, however its inherent data dependencies complicate parallelization. On the other hand, the structure of Givens rotation provides many opportunities for concurrency, but is typically limited by the availability of computing resources. We propose a deeply pipelined reconfigurable architecture that can be dynamically configured to perform either approach in a manner that takes advantage of the strengths of each. At runtime, the input matrix is first partitioned into numerous sub-matrices. Our architecture then performs parallel Householder transformations on the sub-matrices in the same column block, which is followed by parallel Givens rotations to annihilate the remaining unneeded individual off-diagonals. Analysis of our design indicates the potential to achieve a performance of 10.5 GFLOPS with speedups of up to 1.46fiX, 1.15Xfi and 13.75fiX compared to the MKL implementation, a recent FPGA design and a Matlab solution, respectively.

[1]  Nachiket Kapre,et al.  Enhancing performance of Tall-Skinny QR factorization using FPGAs , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[2]  Mauro Leoncini,et al.  Parallel Complexity of Householder QR Factorization , 1996, ESA.

[3]  J. Saniie,et al.  FPGA implementation of fast QR decomposition based on givens rotation , 2012, 2012 IEEE 55th International Midwest Symposium on Circuits and Systems (MWSCAS).

[4]  W. Gentleman Error analysis of QR decompositions by Givens transformations , 1975 .

[5]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[6]  H. Walker Implementation of the GMRES method using householder transformations , 1988 .

[7]  Miriam Leeser,et al.  A truly two-dimensional systolic array FPGA implementation of QR decomposition , 2009, TECS.

[8]  K. Dharmarajan,et al.  Parallel VLSI algorithm for stable inversion of dense matrices , 1989 .

[9]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[10]  Viktor Öwall,et al.  A scalable pipelined complex valued matrix inversion architecture , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[11]  Peng Xue,et al.  Progressive equalizer matrix calculation using QR decomposition in MIMO-OFDM systems , 2013, 2013 IEEE 10th Consumer Communications and Networking Conference (CCNC).

[12]  Thomas Hérault,et al.  Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[13]  Mark A. Richards,et al.  QR decomposition on GPUs , 2009, GPGPU-2.

[14]  Yves Robert,et al.  Tiled QR factorization algorithms , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  James Demmel,et al.  Communication-Avoiding QR Decomposition for GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[16]  Namrata Vaswani,et al.  Recursive sparse recovery in large but correlated noise , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[17]  Mostafa I. Soliman,et al.  Efficient implementation of QR decomposition on intel multi-core processors , 2011, 2011 Seventh International Computer Engineering Conference (ICENCO'2011).

[18]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[19]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[20]  Kleanthis Psarris,et al.  Synthesizing Tiled Matrix Decomposition on FPGAs , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[21]  John V. McCanny,et al.  Implementation of adaptive beamforming based on QR decomposition for CDMA , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..