Accelerating Matrix Operations with Improved Deeply Pipelined Vector Reduction

Many scientific or engineering applications involve matrix operations, in which reduction of vectors is a common operation. If the core operator of the reduction is deeply pipelined, which is usually the case, dependencies between the input data elements cause data hazards. To tackle this problem, we propose a new reduction method with low latency and high pipeline utilization. The performance of the proposed design is evaluated for both single data set and multiple data set scenarios. Further, QR decomposition is used to demonstrate how the proposed method can accelerate its execution. We implement the design on an FPGA and compare its results to other methods.

[1]  Jack Dongarra,et al.  Enhancing Parallelism of Tile QR Factorization for Multicore Architectures , 2010 .

[2]  Kleanthis Psarris,et al.  Multiple data set reduction on FPGAs , 2010, 2010 International Conference on Field-Programmable Technology.

[3]  Kleanthis Psarris,et al.  Accelerating matrix decomposition with replications , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[4]  Viktor K. Prasanna,et al.  An FPGA-Based Application-Specific Processor for Efficient Reduction of Multiple Variable-Length Floating-Point Data Sets , 2006, IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06).

[5]  Viktor K. Prasanna,et al.  High-performance and area-efficient reduction circuits on FPGAs , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[6]  Kleanthis Psarris,et al.  An improved reduction algorithm with deeply pipelined operators , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[7]  Viktor K. Prasanna,et al.  High-performance FPGA-based general reduction methods , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[8]  Kleanthis Psarris,et al.  Applying Out-of-Core QR Decomposition Algorithms on FPGA-Based Systems , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[9]  Viktor K. Prasanna,et al.  A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[10]  Henk J. Sips,et al.  An Improved Vector-Reduction Method , 1991, IEEE Trans. Computers.

[11]  Kai Hwang,et al.  Vector-Reduction Techniques for Arithmetic Pipelines , 1985, IEEE Transactions on Computers.

[12]  Peter M. Kogge,et al.  The Architecture of Pipelined Computers , 1981 .

[13]  Julien Langou,et al.  Parallel tiled QR factorization for multicore architectures , 2007, Concurr. Comput. Pract. Exp..

[14]  Viktor K. Prasanna,et al.  Designing scalable FPGA-based reduction circuits using pipelined floating-point cores , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[15]  Viktor K. Prasanna,et al.  High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[16]  Robert A. van de Geijn,et al.  Parallel out-of-core computation and updating of the QR factorization , 2005, TOMS.