Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %.

[1]  Kurt Keutzer,et al.  An automated exploration framework for FPGA-based soft multiprocessor systems , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[2]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Jürgen Becker,et al.  Operating System for Runtime Reconfigurable Multiprocessor Systems , 2011, Int. J. Reconfigurable Comput..

[4]  Nachiket Kapre,et al.  Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[5]  Wayne Luk,et al.  Floating-Point FPGA: Architecture and Modeling , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Wayne Luk,et al.  Reconfigurable computing: architectures and design methods , 2005 .

[7]  Sotirios G. Ziavras,et al.  Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration , 2006 .

[8]  Jürgen Becker,et al.  A Taxonomy of Reconfigurable Single-/Multiprocessor Systems-on-Chip , 2009, Int. J. Reconfigurable Comput..

[9]  Scott Hauck,et al.  Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation , 2007 .

[10]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[11]  Martin D. F. Wong,et al.  DDBDD: Delay-Driven BDD Synthesis for FPGAs , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[12]  Jürgen Becker,et al.  High performance reconfigurable multi-processor-based computing on FPGAs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[13]  Jürgen Teich,et al.  PARO: Synthesis of Hardware Accelerators for Multi-Dimensional Dataflow-Intensive Applications , 2008, ARC.

[14]  Daniel Ménard,et al.  Floating-to-Fixed-Point Conversion for Digital Signal Processors , 2006, EURASIP J. Adv. Signal Process..

[15]  Joel H. Saltz,et al.  A Comparative Analysis of Static and Dynamic Load Balancing Strategies , 1986, ICPP.

[16]  Maya Gokhale,et al.  Stream-oriented FPGA computing in the Streams-C high level language , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[17]  Ranga Vemuri,et al.  Fine-grained and coarse-grained behavioral partitioning with effective utilization of memory and design space exploration for multi-FPGA architectures , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[18]  Mitsuhisa Sato,et al.  Preliminary evaluation of dynamic load balancing using loop re-partitioning on Omni/SCASH , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[19]  Greg Brown,et al.  A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications , 2012, FPGA '12.

[20]  Russell Tessier,et al.  FPGA Architecture: Survey and Challenges , 2008, Found. Trends Electron. Des. Autom..

[21]  Fadi J. Kurdahi,et al.  Automatic compilation to a coarse-grained reconfigurable system-opn-chip , 2003, TECS.

[22]  Tarek A. El-Ghazawi,et al.  The Promise of High-Performance Reconfigurable Computing , 2008, Computer.

[23]  Keith D. Underwood,et al.  FPGAs vs. CPUs: trends in peak floating-point performance , 2004, FPGA '04.

[24]  Michael F. P. O'Boyle,et al.  MILEPOST GCC: machine learning based research compiler , 2008 .

[25]  A. DeHon,et al.  Parallelizing sparse Matrix Solve for SPICE circuit simulation using FPGAs , 2009, 2009 International Conference on Field-Programmable Technology.

[26]  Jürgen Becker,et al.  A Heterogeneous Multicore System on Chip with Run-Time Reconfigurable Virtual FPGA Architecture , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[27]  Alan Burns,et al.  A survey of hard real-time scheduling for multiprocessor systems , 2011, CSUR.

[28]  Majid Sarrafzadeh,et al.  An optimal algorithm for minimizing run-time reconfiguration delay , 2004, TECS.

[29]  Olivier Sentieys,et al.  Real-time scheduling on heterogeneous system-on-chip architectures using an optimised artificial neural network , 2011, J. Syst. Archit..

[30]  High-Level Synthesis Tools for Xilinx FPGAs , 2010 .

[31]  Bruce A. Draper,et al.  High-Level Language Abstraction for Reconfigurable Computing , 2003, Computer.

[32]  João M. P. Cardoso On Combining Temporal Partitioning and Sharing of Functional Units in Compilation for Reconfigurable Architectures , 2003, IEEE Trans. Computers.

[33]  Eric Monmasson,et al.  FPGA Design Methodology for Industrial Control Systems—A Review , 2007, IEEE Transactions on Industrial Electronics.

[34]  Mario Cannataro,et al.  Protein-to-protein interactions: Technologies, databases, and algorithms , 2010, CSUR.

[35]  Markku J. Juntti,et al.  Fixed- and Floating-Point Processor Comparison for MIMO-OFDM Detector , 2011, IEEE Journal of Selected Topics in Signal Processing.

[36]  Juanjo Noguera,et al.  Multitasking on reconfigurable architectures: microarchitecture support and dynamic scheduling , 2004, TECS.

[37]  Roman L. Lysecky,et al.  Configuration Locking and Schedulability Estimation for Reduced Reconfiguration Overheads of Reconfigurable Systems , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[38]  Francky Catthoor,et al.  Reducing the reconfiguration overhead: a survey of techniques , 2007, ERSA.

[39]  Guang R. Gao,et al.  Mapping the LU decomposition on a many-core architecture: challenges and solutions , 2009, CF '09.

[40]  Sergios Theodoridis,et al.  A Novel Efficient Cluster-Based MLSE Equalizer for Satellite Communication Channels with-QAM Signaling , 2006, EURASIP J. Adv. Signal Process..

[41]  Javier Castillo,et al.  Operating System for Symmetric Multiprocessors on FPGA , 2008, 2008 International Conference on Reconfigurable Computing and FPGAs.

[42]  I. Du,et al.  Direct Methods , 1998 .

[43]  Sanjoy K. Baruah,et al.  Proportionate progress: a notion of fairness in resource allocation , 1993, STOC '93.

[44]  Viktor K. Prasanna,et al.  High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware , 2008, IEEE Transactions on Computers.

[45]  José Luis Martín,et al.  Overview of FPGA-Based Multiprocessor Systems , 2009, 2009 International Conference on Reconfigurable Computing and FPGAs.

[46]  Scott Hauck,et al.  Reconfigurable computing: a survey of systems and software , 2002, CSUR.

[47]  Gülbin Ezer,et al.  Xtensa with user defined DSP coprocessor microarchitectures , 2000, Proceedings 2000 International Conference on Computer Design.

[48]  Dalia Aoun,et al.  Pfair scheduling improvement to reduce interprocessor migrations , 2008 .

[49]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[50]  Anshul Gupta,et al.  Recent advances in direct methods for solving unsymmetric sparse systems of linear equations , 2002, TOMS.

[51]  Tao Yang,et al.  Efficient run-time support for irregular task computations with mixed granularities , 1996, Proceedings of International Conference on Parallel Processing.

[52]  Pedro C. Diniz,et al.  Compiling for reconfigurable computing: A survey , 2010, CSUR.

[53]  Prawat Nagvajara,et al.  Sparse LU Decomposition using FPGA ⋆ , 2008 .

[54]  Ioannis Sourdis,et al.  Hardware OS Communication Service and Dynamic Memory Management for RSoCs , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[55]  Martyn F. Guest,et al.  An overview of FPGAs and FPGA programming; initial experiences at Daresbury , 2006 .

[56]  Leon O. Chua,et al.  An efficient heuristic cluster algorithm for tearing large-scale networks , 1977 .

[57]  Danny Crookes,et al.  From application descriptions to hardware in seconds: a logic-based approach to bridging the gap , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[58]  Ali Ahmadinia Optimal Free-Space Management and Routing-Conscious Dynamic Placement for Reconfigurable Devices , 2007, IEEE Transactions on Computers.

[59]  Yu Chen,et al.  A Survey on the Application of FPGAs for Network Infrastructure Security , 2011, IEEE Communications Surveys & Tutorials.

[60]  Vlad Mihai Sima,et al.  Compiler assisted runtime task scheduling on a reconfigurable computer , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[61]  Mehrdad Moallem,et al.  Reconfigurable system for real-time embedded control applications , 2010 .

[62]  Cristina Nita-Rotaru,et al.  A survey of attack and defense techniques for reputation systems , 2009, CSUR.