FPGA acceleration of large irregular dataflow graphs is often limited by the long tail distribution of parallelism on fine-grained overlay dataflow architectures. In this paper, we show how to overcome these limitations by exploiting criticality information along compute paths; both statically during graph pre-processing and dynamically at runtime. We statically reassociate the high-fanin dataflow chains by providing faster routes for late arriving inputs. We also perform a fanout decomposition and selective node replication in order to distribute serialization costs across multiple PEs. Additionally, we modify the dataflow firing rule in hardware to prefer critical nodes when multiple nodes are ready for dataflow evaluation. Effectively these transformations reduce the length of the tail in the parallelism profile for these large-scale graphs. Across a range of dataflow benchmarks extracted from Sparse LU factorization, we demonstrate up to 2.5× (mean 1.21×) improvement when using the static pre-processing alone, a 2.4× (mean 1.17×) improvement when using only dynamic optimizations and an overall 2.9× (mean 1.39×) improvement when both static and dynamic optimizations are enabled. These improvements are on top of 3--10× speedups over CPU implementations without our transformation enabled.
[1]
A. DeHon,et al.
Parallelizing sparse Matrix Solve for SPICE circuit simulation using FPGAs
,
2009,
2009 International Conference on Field-Programmable Technology.
[2]
Jack B. Dennis,et al.
A preliminary architecture for a basic data-flow processor
,
1974,
ISCA '98.
[3]
Timothy A. Davis,et al.
Algorithm 907
,
2010
.
[4]
Nachiket Kapre,et al.
SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator
,
2011
.
[5]
Siddhartha,et al.
Breaking Sequential Dependencies in FPGA-Based Sparse LU Factorization
,
2014,
FCCM 2014.