An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads

Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.

[1]  Donald Yeung,et al.  Experience with fine-grain synchronization in MIMD machines for preconditioned conjugate gradient , 1993, PPOPP '93.

[2]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[3]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[4]  Li-Shiuan Peh,et al.  A Statistical Traffic Model for On-Chip Interconnection Networks , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.

[5]  Partha Pratim Pande,et al.  Performance evaluation and design trade-offs for network-on-chip interconnect architectures , 2005, IEEE Transactions on Computers.

[6]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[7]  Nachiket Kapre,et al.  Packet Switched vs. Time Multiplexed FPGA Overlay Networks , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[8]  A. DeHon,et al.  Parallelizing sparse Matrix Solve for SPICE circuit simulation using FPGAs , 2009, 2009 International Conference on Field-Programmable Technology.

[9]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[10]  Lionel M. Ni,et al.  A survey of wormhole routing techniques in direct networks , 1993, Computer.

[11]  Andrew B. Kahng,et al.  Improved algorithms for hypergraph bipartitioning , 2000, ASP-DAC '00.

[12]  Jeong-Gun Lee,et al.  Implications of Rent's Rule for NoC Design and Its Fault-Tolerance , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[13]  Radu Marculescu,et al.  On-chip traffic modeling and synthesis for MPEG-2 video applications , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[14]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[15]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[16]  Timothy Mark Pinkston,et al.  A methodology for designing efficient on-chip interconnects on well-behaved communication patterns , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[17]  Wei Tsang Ooi,et al.  Approximate VCCs: a new characterization of multimedia workloads for system-level MpSoC design , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[18]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[19]  Daniel D. Gajski,et al.  High ― Level Synthesis: Introduction to Chip and System Design , 1992 .

[20]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[21]  Vaughn Betz,et al.  Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density , 1999, FPGA '99.

[22]  Timothy A. Davis,et al.  Algorithm 907 , 2010 .